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Abstract 


A  modern  practitioner  of  machine  learning  must  often  consider  trade-offs  between  accuracy  and 
complexity  when  selecting  from  available  machine  learning  algorithms.  Prediction  tasks  can  range 
from  requiring  real-time  performance  to  being  largely  unconstrained  in  their  use  of  computational 
resources.  In  each  setting,  an  ideal  algorithm  utilizes  as  much  of  the  available  computation  as 
possible  to  provide  the  most  accurate  result. 

This  issue  is  further  complicated  by  applications  where  the  computational  constraints  are  not  fixed 
in  advance.  In  many  applications  predictions  are  often  needed  in  time  to  allow  for  adaptive  be¬ 
haviors  which  respond  to  real-time  events.  Such  constraints  often  rely  on  a  number  of  factors  at 
prediction  time,  making  it  difficult  to  select  a  fixed  prediction  algorithm  a  priori.  In  these  situ¬ 
ations,  an  ideal  approach  is  to  use  an  anytime  prediction  algorithm.  Such  an  algorithm  rapidly 
produces  an  initial  prediction  and  then  continues  to  refine  the  result  as  time  allows,  producing  final 
results  which  dynamically  improve  to  fit  any  computational  budget. 

Our  approach  uses  a  greedy,  cost-aware  extension  of  boosting  which  fuses  the  disparate  areas  of 
functional  gradient  descent  and  greedy  sparse  approximation  algorithms.  By  using  a  cost-greedy 
selection  procedure  our  algorithms  provide  an  intuitive  and  effective  way  to  trade-off  computa¬ 
tional  cost  and  accuracy  for  any  computational  budget.  This  approach  learns  a  sequence  of  predic¬ 
tors  to  apply  as  time  progresses,  using  each  new  result  to  update  and  improve  the  current  prediction 
as  time  allows.  Furthermore,  we  present  theoretical  work  in  the  different  areas  we  have  brought 
together,  and  show  that  our  anytime  approach  is  guaranteed  to  achieve  near-optimal  performance 
with  respect  to  unknown  prediction  time  budgets.  We  also  present  the  results  of  applying  our  al¬ 
gorithms  to  a  number  of  problem  domains  such  as  classification  and  object  detection  that  indicate 
that  our  approach  to  anytime  prediction  is  more  efficient  than  trying  to  adapt  a  number  of  existing 
methods  to  the  anytime  prediction  problem. 

We  also  present  a  number  of  contributions  in  areas  related  to  our  primary  focus.  In  the  functional 
gradient  descent  domain,  we  present  convergence  results  for  smooth  objectives,  and  show  that  for 
non-smooth  objectives  the  widely  used  approach  fails  both  in  theory  and  in  practice.  To  rectify 
this  we  present  new  algorithms  and  corresponding  convergence  results  for  this  domain.  We  also 
present  novel,  time-based  versions  of  a  number  of  greedy  feature  selection  algorithms  and  give 
corresponding  approximation  guarantees  for  the  performance  of  these  algorithms. 
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Chapter  1 
Introduction 


When  analyzing  the  performance  of  any  machine  learning  approach,  there  are  often  two  critical 
factors  considered:  the  predictive  accuracy  of  the  algorithm  and  the  cost  or  strain  on  resources  of 
using  a  given  algorithm.  Furthermore,  these  two  metrics  of  accuracy  and  cost  are  typically  opposed 
to  each  other.  Increasing  the  accuracy  of  an  algorithm  often  requires  increasing  the  complexity  of 
the  underlying  model,  which  comes  with  an  increase  in  cost,  and  vice  versa.  This  trade-off  between 
cost  and  accuracy  is  an  inherently  difficult  problem  and  is  the  focus  of  this  work. 


1.1  Motivation 

The  number  of  machine  learning  applications  which  involve  real  time  and  latency  sensitive  pre¬ 
dictions  is  growing  rapidly.  In  areas  such  as  robotics,  decisions  must  be  made  on  the  fly  and  in 
time  to  allow  for  adaptive  behaviors  which  respond  to  real-time  events.  In  computer  vision,  pre¬ 
diction  algorithms  must  often  keep  up  with  high  resolution  streams  of  live  video  from  multiple 
sources  without  sacrificing  accuracy.  Finally,  prediction  tasks  in  web  applications  must  be  carried 
out  with  response  to  incoming  data  or  user  input  without  significantly  increasing  latency,  and  the 
computational  costs  associated  with  hosting  a  service  are  often  critical  to  its  viability.  For  such 
applications,  the  decision  to  use  a  larger,  more  complex  predictor  with  higher  accuracy  or  a  less 
accurate,  but  significantly  faster  predictor  can  be  difficult. 

To  this  end,  we  will  focus  on  the  prediction  or  test-time  cost  of  a  model  in  this  work,  and  the 
problem  of  trading-off  between  prediction  cost  and  accuracy.  While  the  cost  of  building  a  model 
is  an  important  consideration,  the  advent  of  cloud  computing  and  a  large  increase  in  the  general 
computing  power  available  means  that  the  resources  available  at  training  time  are  often  much  less 
constrained  than  the  prediction  time  requirements.  When  balancing  training  costs,  concerns  such 
as  scalability  and  tractability  are  often  more  important,  as  opposed  to  factors  such  as  latency  which 
are  more  directly  related  to  the  complexity  of  the  model. 

The  problem  of  trading-off  prediction  cost  and  accuracy  is  considered  throughout  the  litera¬ 
ture,  both  explicitly  and  implicitly.  Implicitly,  reduced  model  complexity,  in  the  form  of  reduced 
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memory  or  computational  requirements,  is  a  feature  often  used  to  justify  reduced  accuracy  when 
comparing  to  previous  work.  In  other  settings,  entirely  new  algorithms  are  developed  when  models 
are  too  costly  for  a  given  application. 

Explicitly,  there  are  is  a  wide  array  of  approaches  for  generating  models  of  various  complexity 
and  comparing  their  predictive  performance.  We  will  now  discuss  some  of  the  existing  approaches 
to  this  problem. 

Varying  Model  Complexity  Directly 

In  many  settings,  model  complexity  can  often  be  tuned  directly.  In  tuning  these  parameters  and 
comparing  performance,  the  cost  and  accuracy  trade-off  is  presented  directly,  and  the  practicioner 
is  able  to  choose  from  among  these  points  directly  in  an  ad-hoc  way,  typically  selecting  the  highest 
accuracy  model  which  fits  within  their  computational  constraints.  This  approach  is  closest  to  the 
implicit  approach  of  discussing  cost  and  accuracy  trade-offs,  where  the  trade-off  is  considered 
external  to  the  learning  problem  itself. 

Examples  include: 

•  Tuning  the  number  and  structure  of  hidden  units  in  a  neural  network. 

•  Tuning  the  number  of  exemplars  used  in  an  exemplar-based  method. 

•  Tuning  the  number  of  weak  learners  used  in  an  ensemble  method. 

•  Manually  selecting  different  sets  of  features  or  feature  transformations  to  train  the  model  on. 

Constraining  the  Model 

Related  to  the  previous  approach,  another  way  to  trade-off  cost  and  accuracy  is  to  specify  some 
kind  of  constraint  on  the  cost  of  the  model.  In  constrast  to  the  previous  approach,  the  constraint  is 
usually  made  explicit  at  training  time,  and  the  learning  algorithm  optimizes  the  accuracy  directly 
with  knowledge  of  the  constraint. 

Under  this  regime,  the  constraint  can  be  related  to  the  complexity  of  the  model,  but  is  often 
more  directly  related  to  the  prediction  cost  of  interest.  Since  these  constraints  are  not  directly  re¬ 
lated  to  the  complexity  of  the  model,  they  often  require  new  algorithms  and  methods  for  optimizing 
the  model  subject  to  such  a  constraint. 

Examples  of  this  method  include: 

•  Assuming  that  each  feature  used  by  a  model  has  some  computational  cost  and  using  some 
kind  of  budgeted  feature  selection  approach.  In  this  setting,  a  model  can  have  a  high  com¬ 
plexity  and  still  have  low  cost,  as  the  feature  computation  time  is  the  dominant  factor. 

•  Using  a  dimensionality  reduction  or  sparse  coding  technique  to  otherwise  reduce  the  dimen¬ 
sionality  of  inputs  and  eventual  computational  cost. 
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Regularizing  the  Model 

Under  this  approach,  one  augments  the  accuracy  objective  being  optimized  with  a  cost  term,  and 
then  optimizes  the  model  to  fit  this  single  combined  objective.  By  adjusting  the  importance  of  the 
cost  and  accuracy  factors  in  the  objective,  the  model  will  select  some  specific  point  on  the  spectrum 
of  possible  trade-offs. 

While  regularization  is  often  used  to  reduce  the  complexity  of  a  model  to  improve  accuracy, 
e.g.  eliminating  error  due  to  overfitting,  it  can  also  be  used  to  reduce  the  complexity  of  the  model 
to  handle  a  scarcity  of  prediction  time  resources. 

Examples  include: 

•  Using  a  regularized  version  of  the  constraint  used  in  the  budgeted  feature  selection  problem 
and  optimizing  the  model  using  in  this  constraint. 

•  Using  a  heavily  regularized  objective  to  increase  the  sparsity  of  a  model  and  hence  the  pro¬ 
cessor  and  memory  usage  of  that  model. 

Generating  Approximate  Predictions 

One  final  approach  that  is  substantially  different  from  the  previous  ones  is  to  use  some  fixed,  po¬ 
tentially  expensive,  model,  but  improve  the  cost  of  obtaining  predictions  from  that  model  directly. 
This  can  be  achieved  by  computing  approximations  of  the  predictions  the  model  would  generate 
if  given  increased  computational  resources.  This  can  either  be  done  by  generating  an  approximate 
version  of  the  model  for  use  in  prediction,  or  using  some  faster  but  less  accurate  algorithm  for 
prediction. 

Examples  include: 

•  Using  an  approximate  inference  technique  in  a  graphical  model. 

•  Searching  only  a  portion  of  the  available  space  in  a  search-based  or  exemplar  based  method. 

•  Using  techniques  such  as  cascades  or  early-exits  to  improve  the  computation  time  of  ensem¬ 
ble  predictions. 

Almost  all  the  techniques  described  above  are  best  utilized  when  the  computation  constraints 
are  well  understood  at  training  time,  however.  With  the  exception  of  a  few  algorithms  which  fall  in 
to  the  approximate  prediction  category,  each  method  requires  the  practitioner  to  make  descisions  at 
training  time  which  will  affect  the  resulting  trade-off  of  cost  and  accuracy  made  by  the  model.  This 
is  a  significant  drawback,  as  it  requires  the  practitioner  to  understand  the  cost  accuracy  trade-off  at 
some  level  and  understand  it  a  priori. 

In  practice,  making  this  trade-off  at  training  time  can  have  adverse  effects  on  the  accuracy 
of  future  predictions.  In  many  settings,  such  as  cloud  computing,  the  available  computational 
resources  may  change  significantly  over  time.  For  instance,  prices  for  provisioning  machines  may 
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vary  significantly  depending  on  the  time  of  day,  or  idle  machines  may  be  able  to  be  utilized  in  an 
on  demand  manner  to  improve  predictions. 

In  other  settings,  the  resources  may  change  due  to  the  nature  of  the  problem  or  environment. 
For  example,  an  autonomous  agent  may  want  to  use  a  reasonably  fast  method  for  predicting  object 
locations  as  a  first  attempt  when  performing  some  task,  but  would  like  a  slower,  more  accurate 
method  of  last  resort  should  the  first  attempt  fail. 

As  a  final  example,  consider  the  problem  of  generating  batch  predictions  on  a  large  set  of  test 
examples.  For  example,  in  the  object  detection  domain  a  large  number  of  examples  are  generated 
from  a  single  input  image,  corresponding  to  each  of  the  locations  in  the  image.  Some  of  these 
examples,  such  as  cluttered  areas  of  the  image,  may  be  inherently  more  difficult  to  classify  than 
other  examples,  such  as  a  patch  of  open  sky.  In  this  setting  our  computational  constraint  is  actually 
on  the  prediction  cost  of  the  batch  of  examples  as  a  whole  and  not  on  each  single  example.  To 
obtain  high  accuracy  at  low  cost,  any  prediction  method  would  ideally  focus  its  efforts  on  the  more 
difficult  examples  in  the  batch.  Using  any  one  of  the  fixed  methods  above,  we  would  spend  the 
same  amount  of  time  on  each  example  in  the  batch,  resulting  in  less  efficient  use  of  resources. 


1.2  Approach 

To  handle  many  of  the  failure  situations  described  above,  it  would  be  useful  to  work  with  prediction 
algorithms  capable  of  initially  giving  crude  but  rapid  estimates  and  then  refining  the  results  as  time 
allows.  For  situations  where  the  computational  resources  are  not  known  apriori,  or  where  we  would 
like  to  dynamically  adapt  the  resources  used,  such  an  algorithm  can  automatically  adjust  to  fill  any 
allocated  budget  at  test- time. 

For  example,  in  a  robotics  application  such  as  autonomous  navigation,  it  may  sometimes  be 
the  case  that  the  robot  can  rapidly  respond  with  predictions  about  nearby  obstacles,  but  can  spend 
more  time  reasoning  about  distant  ones  to  generate  more  accurate  predictions. 

As  first  studied  by  Zilberstein  [1996],  anytime  algorithms  exhibit  exactly  this  property  of  pro¬ 
viding  increasingly  better  results  given  more  computation  time.  Through  this  previous  work,  Zil¬ 
berstein  has  identified  a  number  of  desirable  properties  for  an  anytime  algorithm  to  possess.  In 
terms  of  predictions  these  properties  are: 

•  Interruptability:  a  prediction  can  be  generated  at  any  time. 

•  Monotonicity:  prediction  quality  is  non-decreasing  over  time. 

•  Diminishing  Returns:  prediction  quality  improves  fastest  at  early  stages. 

An  algorithm  meeting  these  specifications  will  be  able  to  dynamically  adjust  its  predictions  to  fit 
within  any  test-time  budget,  avoiding  the  need  to  make  reason  about  computational  constraints  at 
training  time. 
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Thesis  Statement:  As  it  is  often  difficult  to  make  decisions  which  trade  off  final  cost 
and  accuracy  a  priori,  we  should  instead  seek  to  train  predictors  which  dynamically 
adjust  the  computations  performed  at  prediction  time  to  meet  any  computational  bud¬ 
get.  The  algorithms  for  generating  these  predictors  should  further  be  able  to  reason 
about  the  cost  and  benefit  of  each  element  of  a  prediction  computation  to  automati¬ 
cally  select  those  which  improve  accuracy  most  efficiently,  and  should  do  so  without 
knowing  the  test-time  budget  apriori. 

Our  work  targets  these  specific  properties  using  a  hybrid  approach.  To  obtain  the  incremental, 
interruptable  behavior  we  would  like  for  updating  predictions  over  time  we  will  leam  an  additive 
ensemble  of  weaker  predictors.  This  work  will  build  off  the  previous  work  in  this  area  of  boosted 
ensemble  learning  [Schapire,  2002],  specifically  the  functional  gradient  descent  approach  [Mason 
et  al.,  1999,  Friedman,  2000]  for  generalizing  the  behavior  of  boosting  to  arbitrary  objectives  and 
weak  predictors.  In  these  approaches  we  learn  a  predictor  which  is  simply  the  linear  combination  of 
a  sequence  of  weak  predictors.  This  final  predictor  can  easily  be  made  interruptable  by  evaluating 
the  weak  predictors  in  sequence  and  computing  the  linear  combination  of  the  outputs  whenever  a 
prediction  is  desired. 

We  will  augment  the  standard  functional  gradient  approach  with  ideas  taken  from  greedy  se¬ 
lection  algorithms  [Tropp,  2004,  Streeter  and  Golovin,  2008,  Das  and  Kempe,  2011]  typically  used 
in  the  submodular  optimization  and  sparse  approximation  domains.  We  will  use  a  cost-greedy  ver¬ 
sion  of  functional  gradient  methods  which  select  the  next  weak  predictor  based  on  an  improvement 
in  accuracy  scaled  by  the  cost  of  the  weak  learner.  This  cost-greedy  approach  ensures  that  the  we 
select  sequences  of  weak  predictors  that  increase  accuracy  as  efficiently  as  possible,  satisfying  the 
last  two  properties. 

As  we  will  show  later,  this  relatively  simple  framework  can  be  applied  to  a  wide  range  of 
problems.  Furthermore,  we  provide  theoretical  results  that  show  that  this  method  is  guaranteed 
to  achieve  near-optimal  performance  as  the  budget  increases,  without  knowing  the  specific  budget 
apriori,  and  observe  that  this  near-optimality  holds  in  a  number  of  experimental  applications. 


1.3  Related  Work 

We  now  detail  a  number  of  approaches  and  previous  work  that  are  related  both  to  the  focus  of  this 
work,  and  the  disparate  areas  we  fuse  together  in  our  methods  and  analysis. 

Boosting  and  Functional  Gradient  Methods 

Boosting  is  a  versatile  meta-algorithm  for  combining  together  multiple  simple  hypotheses,  or  weak 
predictors,  to  form  a  single  complex  hypothesis  with  superior  performance.  The  power  of  this 
meta-algorithm  lies  in  its  ability  to  craft  hypotheses  which  can  achieve  arbitrary  performance  on 
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training  data  using  only  weak  learners  that  perform  marginally  better  than  random.  Schapire  [2002] 
give  a  very  good  overview  of  general  boosting  techniques  and  applications. 

To  date,  much  of  the  work  on  boosting  has  focused  on  optimizing  the  performance  of  this 
meta- algorithm  with  respect  to  specific  loss  functions  and  problem  settings.  The  AdaBoost  algo¬ 
rithm  [Freund  and  Schapire,  1997]  is  perhaps  the  most  well  known  and  most  successful  of  these. 
AdaBoost  focuses  specifically  on  the  task  of  classification  via  the  minimization  of  the  exponen¬ 
tial  loss  by  boosting  weak  binary  classifiers  together,  and  can  be  shown  to  be  near  optimal  in  this 
setting.  Looking  to  extend  upon  the  success  of  AdaBoost,  related  algorithms  have  been  developed 
for  other  domains,  such  as  RankBoost  [Freund  et  al.,  2003]  and  mutliclass  extensions  to  AdaBoost 
[Mukherjee  and  Schapire,  2010].  Each  of  these  algorithms  provides  both  strong  theoretical  and 
experimental  results  for  their  specific  domain,  including  corresponding  weak  to  strong  learning 
guarantees,  but  extending  boosting  to  these  and  other  new  settings  is  non-trivial. 

Recent  attempts  have  been  successful  at  generalizing  the  boosting  approach  to  certain  broader 
classes  of  problems,  but  their  focus  is  also  relatively  restricted.  Mukherjee  and  Schapire  [2010] 
present  a  general  theory  of  boosting  for  multiclass  classification  problems,  but  their  analysis  is 
restricted  to  the  multiclass  setting.  Zheng  et  al.  [2007]  give  a  boosting  method  which  utilizes  the 
second-order  Taylor  approximation  of  the  objective  to  optimize  smooth,  convex  losses.  Unfortu¬ 
nately,  the  corresponding  convergence  result  for  their  algorithm  does  not  exhibit  the  typical  weak 
to  strong  guarantee  seen  in  boosting  analyses  and  their  results  apply  only  to  weak  learners  which 
solve  the  weighted  squared  regression  problem. 

Other  previous  work  on  providing  general  algorithms  for  boosting  has  shown  that  an  intuitive 
link  between  algorithms  like  AdaBoost  and  gradient  descent  exists  [Mason  et  al.,  1999,  Friedman, 
2000],  and  that  many  existing  boosting  algorithms  can  be  reformulated  to  fit  within  this  gradient 
boosting  framework.  Under  this  view,  boosting  algorithms  are  seen  as  performing  a  modified 
gradient  descent  through  the  space  of  all  hypotheses,  where  the  gradient  is  calculated  and  then 
used  to  find  the  weak  hypothesis  which  will  provide  the  best  descent  direction. 

In  the  case  of  smooth  convex  functionals.  Mason  et  al.  [1999]  give  a  proof  of  eventual  conver¬ 
gence  for  the  functional  gradient  method.  This  result  is  similar  to  the  classical  convergence  result 
given  in  Zoutendijk’s  Theorem  [Zoutendijk,  1970],  which  gaurantees  convergence  for  a  variety  of 
descent-based  optimization  algorithms,  as  long  as  the  search  direction  at  every  iteration  is  suffi¬ 
ciently  close  to  the  gradient  of  the  function.  Additionally,  convergence  rates  of  these  algorithms 
have  been  analyzed  for  the  case  of  smooth  convex  functionals  [Ratsch  et  al.,  2002]  and  for  spe¬ 
cific  potential  functions  used  in  classification  [Duffy  and  Helmbold,  2000]  under  the  traditional 
PAC  weak  learning  setting.  Our  result  [Grubb  and  Bagnell,  2011]  extends  these  results  for  smooth 
losses,  and  also  introduces  new  results  and  algorithms  for  non-smooth  losses. 

Submodular  Maximization 

In  our  analysis  of  the  cost-greedy  approaches  in  this  document,  we  will  make  heavy  use  of  the 
submodular  set  function  maximization  framework.  A  good  overview  of  this  domain  is  given  in  the 
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survey  of  submodular  function  maximization  work  by  Krause  and  Golovin  [2012]. 

Most  relevant  to  our  work  are  the  approaches  for  the  budgeted  or  knapsack  constrained  sub- 
modular  maximization  problem.  In  this  setting  each  element  is  assigned  a  cost  and  the  constraint 
is  on  the  sum  of  costs  of  elements  in  the  selected  set,  similar  to  the  knapsack  problem  [Mathews, 
1897],  which  is  the  modular  complement  to  this  setting. 

In  this  domain  the  original  greedy  algorithm  for  submodular  function  maximization  with  a  car¬ 
dinality  constraint  [Nemhauser  et  al.,  1978]  can  be  extended  using  a  cost-greedy  approach  [Khuller 
et  al.,  1999], 

A  number  of  results  from  previous  work  [Khuller  et  al.,  1999,  Krause  and  Guestrin,  2005, 
Leskovec  et  al.,  2007,  Lin  and  Bilmes,  2010]  have  given  variations  on  the  cost-greedy  algorithm 
which  do  have  approximation  bounds  with  factors  of  f  (1  —  y)  and  (1  —  f ).  Unfortunately,  these  al¬ 
gorithms  all  require  apriori  knowledge  of  the  budget  in  order  to  achieve  the  approximation  bounds, 
and  cannot  generate  single,  budget  agnostic  sequences  with  approximation  bounds  for  all  budgets. 

Unfortunately  as  Khuller  et  al.  [1999]  show,  the  standard  approximation  results  do  not  hold  di¬ 
rectly  for  the  cost-greedy  algorithm.  As  we  show  in  Chapter  4,  in  general  there  is  no  single  budget 
agnostic  algorithm  which  can  achieve  a  similar  approximation  guarantee,  but  we  can  achieve  ap¬ 
proximation  guarantees  for  certain  budgets.  Most  directly  related  to  out  work,  we  will  build  off  of 
the  cost-greedy  analysis  of  Streeter  and  Golovin  [2008],  which  gives  an  approximation  guarantee 
for  certain  budgets  dependent  on  the  problem. 

Finally,  our  work  will  build  off  of  previous  work  analyzing  functions  that  are  approximately 
submodular.  Das  and  Kempe  [2011]  give  a  multiplicative  version  of  approximate  submodularity 
called  the  submodularity  ratio.  Similarly,  Krause  and  Cehver  [2010]  give  a  version  of  submodu¬ 
larity  that  includes  an  additive  error  term.  This  additive  error  term  is  similar  to  the  additive  error 
terms  utilized  in  analyzing  online  submodular  maximization  approaches  [Krause  and  Guestrin, 
2005,  Streeter  and  Golovin,  2008,  Ross  et  al.,  2013].  Our  work  later  in  this  document  combines 
both  additive  and  multiplicative  relaxations  of  the  standard  submodularity  definition. 

Sparse  Approximation 

A  common  framework  for  controlling  the  complexity  of  a  model  is  the  sparse  approximation  prob¬ 
lem,  also  referred  to  as  subset  selection,  sparse  decomposition,  and  feature  selection.  In  this  setting 
we  are  given  a  target  signal  or  vector,  and  a  set  of  basis  vectors  to  use  to  reconstruct  the  target. 
These  basis  vectors  are  often  referred  to  as  atoms,  bases,  dictionary  elements,  and,  when  taken  as 
a  whole,  a  dictionary  or  design  matrix.  The  goal  is  to  select  a  sparse  set  of  these  vectors  that  best 
approximates  the  target,  subject  to  some  sparsity  constraint.  An  equivalent  formuation  is  to  select 
a  sparse  weight  vector  with  which  to  combine  the  basis  vectors. 

We  will  later  use  the  sparse  approximation  framework  to  analyze  our  anytime  approach.  In 
general  it  can  be  shown  that  this  problem  is  NP  hard  Natarajan  [1995],  but  a  number  of  practical 
approaches  with  approximation  and  regret  bounds  have  been  developed  in  previous  work. 

Most  relevant  to  this  work  are  the  works  on  analyzing  greedy  feature  selection  algorithms 
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[Krause  and  Cehver,  2010,  Das  and  Kempe,  2011]  which  build  off  of  submodular  maximization 
techniques.  Krause  and  Cehver  [2010]  give  an  analysis  of  the  dictionary  selection  problem,  a 
variant  of  the  subset  selection  problem  where  the  goal  is  to  select  a  larger  subset  or  dictionary  of 
which  smaller  subsets  can  be  used  to  approximate  a  large  set  of  different  targets.  Their  analysis 
relies  on  the  incoherency  of  the  dictionary  elements,  a  geometric  property  which  captures  the 
non-orthogonality  of  the  dictionary.  Das  and  Kempe  [2011]  give  a  similar  analysis  for  the  subset 
selection  and  dictionary  selection  problems,  but  use  spectral  properties  of  the  dictionary  elements 
which  also  captures  the  degree  of  orthogonality. 

Greedy  approaches  to  solving  this  problem  include  Forward-Stepwise  Regression  or  simply 
Forward  Regression  [Miller,  2002],  Matching  Pursuit  [Mallat  and  Zhang,  1993]  also  known  as 
Forward-Stagewise  Regression,  and  Orthogonal  Matching  Pursuit  [Pati  et  al.,  1993].  Forward 
Regression  greedily  selects  elements  which  maximially  improve  reconstruction  error  when  added 
to  the  set  of  bases,  while  the  matching  pursuit  approaches  select  elements  based  on  their  correlation 
with  the  residual  error  remaining  in  the  target  at  each  iteration.  Tropp  [2004]  gives  a  good  analysis 
of  the  Orthogonal  Matching  Pursuit  algorithm  which  uses  the  same  incoherency  parameter  used 
by  Krause  and  Cehver  [2010],  to  show  near-optimal  reconstruction  of  the  target. 

Another  popular  approach  to  the  sparse  approximation  problem  is  to  use  a  convex  relaxation  of 
the  sparsity  constraint  as  a  regularizer,  and  optimize  the  regularized  objective  directly.  Examples 
include  the  Lasso  algorithm  [Tibshirani,  1996],  Basis  Pursuit  [Chen  et  al.,  2001],  and  Least- Angle 
Regression  [Efron  et  al.,  2004].  All  of  these  algorithms  optimize  the  LI  relaxation  of  the  sparsity 
constraint  using  different  methods. 

Lor  the  LI -based  regularization  approaches,  there  are  two  main  focuses  for  proving  the  algo¬ 
rithms  are  successful.  One  [Geer  and  Buhlmann,  2009]  shows  that  the  near-orthogonality  of  the 
vectors  being  selected  implies  that  the  proper  subset  is  selected  with  high  probability.  This  analy¬ 
sis  relies  on  the  RIP,  or  Restricted  Isometry  Property.  The  other  [Juditsky  and  Nemirovski,  2000] 
approach  derives  regret  bounds  with  respect  to  a  sparse,  LI  bounded  linear  combination  of  the 
variables,  and  shows  that  magnitude  of  sparse  vector  used  for  combination  is  the  key  factor  for  the 
bound. 

We  will  follow  a  similar  tack  as  the  weight-based  analysis  of  Juditsky  and  Nemirovski  [2000]  in 
our  work  here.  The  existing  bounds  discussed  above  for  greedy  sparse  approximation  approaches 
all  use  geometric  properties,  similar  to  the  RIP  property.  In  our  work  we  would  instead  like  to 
focus  on  bounds  derived  using  other  properties  which  depend  on  the  magnitude  of  the  combining 
weights  being  small,  and  not  the  underlying  features  being  nearly  orthogonal. 

One  final  piece  of  the  related  literature  that  is  related  to  our  area  of  study  is  the  work  that 
has  been  done  on  the  simultaneous  sparse  approximation  problem.  This  problem  is  similar  to  the 
dictionary  selection  problem  of  Krause  and  Cehver  [2010],  in  that  we  want  to  select  a  subset  of 
bases  that  reconstruct  a  set  of  multiple  target  vectors  well.  The  key  difference  between  these  two 
problems  is  that  dictionary  selection  allows  for  the  selection  of  a  larger  set  of  elements  than  is  used 
to  reconstruct  any  one  target,  while  the  simultaneous  sparse  approximation  problem  uses  the  same 
subset  for  every  single  target. 
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There  exist  both  greedy  and  regularization  approaches  to  solving  this  problem.  Simultaneous 
Orthogonal  Matching  Pursuit  [Cotter  et  al.,  2005,  Chen  and  Huo,  2006,  Tropp  et  al.,  2006]  is  a 
greedy  method  for  solving  this  problem,  based  on  the  single  target  OMP  approach.  In  the  regu¬ 
larization  or  relaxation  approaches,  the  corresponding  relaxation  of  the  sparsity  constraint  uses  an 
Lp  -  Lq  mixed  norm,  typically  an  L1  norm  of  another,  non-sparsity  inducing  norm,  such  as  the  L2 
or  Lqo  norm.  The  approaches  for  solving  this  problem  are  called  Group  Lasso  [Meier  et  al.,  2008, 
Rakotomamonjy,  2011]  algorithms,  and  select  weight  matrices  that  are  sparse  across  features  or 
basis  vectors,  but  dense  across  the  target  vectors,  giving  the  desired  sparse  set  of  selected  bases. 

Budgeted  Prediction 

Our  primary  focus  in  this  work  is  on  the  trade-off  between  prediction  cost  and  accuracy.  Particu¬ 
larly  for  functional  gradient  methods  and  related  ensemble  approaches,  there  have  been  a  number 
of  previous  approaches  that  attempt  to  tackle  the  prediction  cost  and  accuracy  trade-off. 

This  focus  in  the  budgeted  prediction  setting,  also  called  budgeted  learning,  test-time  cost- 
sensitive  learning,  and  resource  efficient  machine  learning,  is  to  try  and  automatically  make  this 
trade-off  in  ways  that  improve  the  cost  of  achieving  good  predictions. 

Similar  to  our  work,  a  number  of  approaches  have  considered  methods  for  improving  the  pre¬ 
diction  costs  of  functional  gradient  methods.  Chen  et  al.  [2012]  and  Xu  et  al.  [2012]  give  reg¬ 
ularization  based  methods  for  augmenting  the  functional  gradient  approach  to  account  for  cost. 
Their  focus  is  on  optimizing  the  feature  computation  time  of  a  model,  and  attempts  to  select  weak 
learners  which  use  cheap  or  already  computed  features.  The  first  approach  [Chen  et  al.,  2012] 
does  this  by  optimizing  the  ordering  and  composition  of  a  boosted  ensemble  after  learning  using 
a  traditional  functional  gradient  approach.  The  second  method  directly  augments  the  weak  learner 
training  procedure  (specifically,  regression  trees)  with  a  cost-aware  regularizer.  This  work  has  also 
been  extended  to  include  a  variant  which  uses  a  branching,  tree-based  structure  [Xu  et  al.,  2013b], 
and  a  variant  suitable  for  anytime  prediction  [Xu  et  al.,  2013a]  following  the  interest  in  this  domain. 
This  latter  work  uses  a  network  of  functional  gradient  modules,  and  backpropagates  a  functional 
gradient  through  the  network,  in  a  manner  similar  to  Grubb  and  Bagnell  [2010]. 

Another  approach  for  augmenting  the  functional  gradient  approach  is  the  sampling-based  ap¬ 
proach  of  Reyzin  [201 1],  which  uses  randomized  sampling  at  prediction  time  weighted  by  cost  to 
select  which  weak  hypotheses  to  evaluate. 

An  early  canonical  approach  for  improving  the  prediction  time  performance  of  these  additive 
functional  models  is  to  use  a  cascade  [Viola  and  Jones,  2001,  2004].  A  cascade  uses  a  sequence 
of  increasingly  complex  classifiers  to  sequentially  select  and  eliminate  examples  for  which  the 
predictor  has  high  confidence  in  the  current  prediction,  and  then  continues  improving  predictions 
on  the  low  confidence  examples.  The  original  formulation  focuses  on  eliminating  negative  exam¬ 
ples,  for  settings  where  positive  examples  are  very  rare  such  as  face  detection,  but  extensions  that 
eliminate  both  classes  [Sochman  and  Matas,  2005]  exist. 

Many  other  variations  on  building  and  optimizing  cascades  exist  [Saberian  and  Vasconcelos, 
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2010,  Brubaker  et  al.,  2008],  and  Cambazoglu  et  al.  [2010]  give  a  version  of  functional  gradient 
methods  which  use  an  early-exit  strategy,  similar  to  the  cascade  approach,  which  generates  predic¬ 
tions  early  if  the  model  is  confident  enough  in  the  current  prediction.  All  these  methods  typically 
target  final  performance  of  the  learned  predictor,  however.  Furthermore,  due  to  the  decision  mak¬ 
ing  structure  of  these  cascades  and  the  permanent  nature  of  prediction  decisions,  these  models 
must  be  very  conservative  in  making  early  decisions  and  are  unable  to  recover  from  early  errors. 
All  of  these  factors  combine  to  make  cascades  poor  anytime  predictors. 

Another,  orthogonal  approach  to  the  functional  gradient  based  ones  detailed  so  far  are  to  treat 
the  problem  as  a  policy  learning  one.  In  this  approach,  we  have  states  corresponding  to  which 
predictions  have  been  generated  so  far,  and  actions  correspond  to  generating  new  predictions  or 
outputting  final  predictions.  Examples  of  this  include  the  value-of-information  approach  of  Gao 
and  Koller  [2011],  the  work  on  learning  a  predictor  skipping  policy  of  Busa-Fekete  et  al.  [2012], 
the  dynamic  predictor  re-ordering  policy  of  He  et  al.  [2013],  and  the  object  recognition  work  of 
Karayev  et  al.  [2012].  In  these  approaches  the  policy  for  selecting  which  predictions  to  generate 
and  which  features  to  use  is  typically  generated  by  modeling  the  problem  as  a  Markov  Decision 
Process  and  using  some  kind  of  reinforcement  learning  technique  to  learn  a  policy  which  selects 
which  weak  hypotheses  or  features  to  compute  next. 

In  the  structured  setting,  Jiang  et  al.  [2012]  proposed  a  technique  for  reinforcement  learn¬ 
ing  that  incorporates  a  user  specified  speed/accuracy  trade-off  distribution,  and  Weiss  and  Taskar 
[2010]  proposed  a  cascaded  analog  for  structured  prediction  where  the  solution  space  is  iteratively 
refined/pruned  over  time.  In  contrast,  our  structured  prediction  work  later  in  this  document  is  fo¬ 
cused  on  learning  a  structured  predictor  with  interruptible,  anytime  properties  which  is  also  trained 
to  balance  both  the  structural  and  feature  computation  times  during  the  inference  procedure.  Re¬ 
cent  work  in  computer  vision  and  robotics  [Sturgess  et  al.,  2012,  de  Nijs  et  al.,  2012]  has  similarly 
investigated  techniques  for  making  approximate  inference  in  graphical  models  more  efficient  via  a 
cascaded  procedure  that  iteratively  prunes  subregions  in  the  scene  to  analyze. 

Previous  approaches  to  the  anytime  prediction  problem  have  focused  on  instance-based  learn¬ 
ing  algorithms,  such  as  nearest  neighbor  classification  [Ueno  et  al.,  2006]  and  novelty  detection 
[Sofman  et  al.,  2010].  These  approaches  use  intelligent  instance  selection  and  ordering  to  acheive 
rapid  performance  improvements  on  common  cases,  and  then  typically  use  the  extra  time  for 
searching  through  the  Tong  tail’  of  the  data  distribution  and  improving  result  for  rare  examples.  In 
the  case  of  the  latter,  the  training  instances  are  even  dynamically  re-ordered  based  on  the  distribu¬ 
tion  of  the  inputs  to  the  prediction  algorithm,  further  improving  performance.  As  mentioned  above, 
a  more  recent  anytime  approach  was  given  by  [Xu  et  al.,  2013a],  and  uses  a  functional  gradient 
method  similar  to  our  initial  work  in  this  area  [Grubb  and  Bagnell,  2012]. 


1.4  Contributions 


We  now  detail  the  structure  of  the  rest  of  the  document,  and  outline  a  number  of  important  contri¬ 
butions  made  in  the  various  areas  related  to  our  anytime  prediction  approach. 
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•  In  Chapter  2  we  extend  previous  work  on  functional  gradient  methods  and  boosting  with 
a  framework  for  analyzing  arbitrary  convex  losses  and  arbitrary  weak  learners,  as  opposed 
to  the  classifiers  and  single  output  regressors  discussed  previously.  We  also  analyze  the 
convergence  of  functional  gradient  methods  for  smooth  functions,  extending  previous  results 
and  generalizing  the  notion  of  weak-to- strong  learning  to  arbitrary  weak  learners.  Finally, 
we  show  that  the  widely  used  traditional  functional  gradient  approaches  fail  to  converge  for 
non-smooth  objective  functions,  and  give  algorithms  and  convergence  results  that  work  in 
the  non-smooth  setting. 

•  In  Chapter  3  we  introduce  two  extensions  to  the  functional  gradient  approach  detailed  in 
Chapter  2.  The  first  extends  functional  gradient  methods  from  simple  supervised  approaches 
to  structured  prediction  problems  using  an  additive,  iterative  decoding  approach.  The  second 
addresses  overfitting  issues  that  arise  when  previous  predictions  are  used  as  inputs  to  later 
predictors  in  the  functional  gradient  setting,  and  adapts  the  method  of  stacking  to  this  domain 
to  reduce  this  overfitting. 

•  In  Chapter  4  we  detail  our  analysis  of  greedy  algorithms  for  the  budgeted  monotone  sub- 
modular  maximization  problem,  and  derive  approximation  bounds  that  demonstrate  the  near- 
optimal  performance  of  these  greedy  approaches.  We  also  extend  previous  work  from  the 
literature  with  a  characterization  of  approximately  submodular  functions,  and  analyze  the 
behvior  of  algorithms  which  are  approximately  greedy  as  well.  Finally,  we  introduce  a  mod¬ 
ified  greedy  approach  that  can  achieve  good  performance  for  any  budget  constraint  without 
knowing  the  budget  apriori. 

•  In  Chapter  5  we  analyze  regularized  variants  of  the  sparse  approximation  problem,  and  show 
that  this  problem  is  equivalent  to  the  budgeted,  approximately  submodular  setting  detailed 
in  Chapter  4.  Using  these  results  we  derive  bounds  that  show  that  novel,  budgeted  or  time- 
aware  versions  of  popular  algorithms  for  this  domain  are  near-optimal  as  well.  In  this  analy¬ 
sis,  we  also  extend  previous  algorithms  and  results  for  the  sparse  approximation  problem  to 
variants  for  arbitrary  smooth  losses  and  simultaneous  targets. 

•  In  Chapter  6  we  introduce  our  cost-greedy,  functional  gradient  approach  for  solving  the  any¬ 
time  prediction.  Building  on  the  results  in  previous  chapters,  we  also  show  that  variants  of 
this  anytime  prediction  approach  are  guaranteed  to  have  near-optimal  performance.  Finally, 
we  demonstrate  how  to  extend  our  anytime  prediction  approach  to  a  number  of  applications. 

•  In  Chapter  7  we  combine  this  anytime  prediction  approach  (Chapter  6)  with  the  structured 
prediction  extensions  of  functional  gradient  methods  (Chapter  3)  to  obtain  an  anytime  struc¬ 
tured  prediction  algorithm.  We  then  demonstrate  this  algorithm  on  the  scene  understanding 
domain. 


CHAPTER  1.  INTRODUCTION 


Part  I 

Functional  Gradient  Methods 
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Chapter  2 

Functional  Gradient  Methods 


In  this  chapter  we  detail  our  framework  for  analyzing  functional  gradient  methods,  and  present 
convergence  results  for  the  general  functional  gradient  approach.  Using  this  new  framework  we 
generalize  the  notion  of  weak-to-strong  learning  from  the  boosting  domain  to  arbitrary  weak  learn¬ 
ers  and  loss  functions.  We  also  extend  existing  results  that  give  weak-to-strong  convergence  for 
smooth  losses,  and  show  that  for  non-smooth  losses  the  widely  used  standard  approach  fails,  both 
theoretically  and  experimentally.  To  counter  this,  we  develop  new  algorithms  and  accompanying 
convergence  results  for  the  non- smooth  setting. 


2.1  Background 

In  the  functional  gradient  setting  we  want  to  learn  a  prediction  function  /  which  minimizes  some 
objective  functional  1Z: 

nun  7 Z[f].  (2.1) 

We  will  also  assume  that  /  is  a  linear  combination  of  simpler  functions  h  G  H 

f(x)  =  ^2oitht(x),  (2.2) 

t 

where  at  G  M.  In  the  boosting  literature,  these  functions  h  e  'H  arc  typically  referred  to  as  weak 
predictors  or  weak  classifiers  and  are  some  set  of  functions  generated  by  another  learning  algorithm 
which  we  can  easily  optimize,  known  as  a  weak  learner.  By  generating  a  linear  combination  of 
these  simpler  functions,  we  hope  to  obtain  better  overall  performance  than  any  single  one  of  the 
weak  learners  h  could  obtain. 

We  will  now  discuss  the  specific  properties  of  the  function  space  that  the  functions  /  and  h 
are  drawn  from  that  we  will  utilize  for  our  analysis,  along  with  various  properties  of  the  objective 
functional  TZ. 
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Previous  work  [Mason  et  al.,  1999,  Friedman,  2000]  has  presented  the  theory  underlying  func¬ 
tion  space  gradient  descent  in  a  variety  of  ways,  but  never  in  a  form  which  is  convenient  for 
convergence  analysis.  Recently,  Ratliff  [Ratliff,  2009]  proposed  the  L2  function  space  as  a  natural 
match  for  this  setting.  This  representation  as  a  vector  space  is  particularly  convenient  as  it  dove¬ 
tails  nicely  with  the  analysis  of  gradient  descent  based  algorithms.  We  will  present  here  the  Hilbert 
space  of  functions  most  relevant  to  functional  gradient  boosting. 

2.1.1  L 2  Function  Space 

Given  a  measurable  input  set  X,  a  complete  vector  space  V  of  outputs,  and  measure  //  over  X,  the 
function  space  L2(X,  V,  //)  is  the  set  of  all  equivalence  classes  of  functions  /  :  X  — >  V  such  that 
the  Lebesgue  integral 

[  II f(x)\\ldii  (2.3) 

J  x 

is  finite.  In  the  special  case  where  fi  is  a  probability  measure  P  with  density  function  p(x).  Equa¬ 
tion  (2.3)  is  simply  equivalent  to  Ep[||/(x) ||2]. 

This  Hilbert  space  has  a  natural  inner  product  and  norm: 

{f,9)»  =  J  (f(x),g(x))vdfi 
Wf\\l  =  (fJ)» 

=  [  \\f(x)\\ldfi, 

J  x 

which  simplifies  as  one  would  expect  for  the  probability  measure  case: 

(f,g)P  =  ^p[(f(x),g(x))v] 

II/IIp  =  Ep[||/(*)||J]. 

We  parameterize  these  operations  by  //  to  denote  their  reliance  on  the  underlying  measure.  For 
a  given  input  space  X  and  output  space  V,  different  underlying  measures  can  produce  a  number  of 
spaces  over  functions  /  :  X  — >■  V.  The  underlying  measure  will  also  change  the  elements  of  the 
space,  which  are  the  equivalence  classes  for  the  relation  ~: 

f~g  \\f -g\\l  =  °- 

The  elements  of  L2(X,  V,  fi)  are  required  to  be  equivalence  classes  to  ensure  that  the  space  is  a 
vector  space. 

In  practice,  we  will  often  work  with  the  empirical  probability  distribution  P  for  an  observed 
set  of  points  {xn}%=1.  This  just  causes  the  vector  space  operations  above  to  reduce  to  the  empirical 
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expected  values.  The  inner  product  becomes 

1  N 

( f,g)p  =  -^J2(f(xn),g(xn))v, 

n=  1 

and  the  norm  is  correspondingly 

1  N 

ii/ii^=  vEii-fWiiv- 

71=1 

For  the  sake  of  brevity,  we  will  omit  the  measure  //  unless  otherwise  necessary,  as  most  state¬ 
ments  will  hold  for  all  measures.  When  talking  about  practical  implementation,  the  measure  used 
is  assumed  to  be  the  empirical  probability  P. 

2.1.2  Functionals  and  Convexity 

Consider  a  function  space  T  =  L2(X.  V.  //,).  We  will  be  analyzing  the  behavior  of  functionals 
7Z  :  T  — »  M  over  these  spaces.  A  typical  example  of  a  functional  is  the  point-wise  loss: 

nP[f]=EP[£(f(x))}.  (2.4) 

To  analyze  the  convergence  of  functional  gradient  algorithms  across  these  functionals,  we  need 
to  rely  on  a  few  assumptions.  A  functional  7Z  [/]  is  convex  if  for  all  f.  g  e  T  there  exists  a  function 
V7 Z[f]  such  that 

1Z[g\>1Z[f]  +  (Vn[f],g-f).  (2.5) 

We  say  that  V7 Z[f]  is  a  subgradient  of  the  functional  7 Z  at  /.  We  will  write  the  set  of  all  subgradi¬ 
ents,  or  the  subdifferential,  of  a  functional  TZ  at  function  /  as 

dlZ[f]  =  (V7 Z[f]  |  V7^[/]  satisfies  Equation  (2.5)  Vc/}  .  (2.6) 

As  an  example  subgradient,  consider  the  pointwise  risk  functional  given  in  Equation  (2.4).  The 
corresponding  subgradient  over  L2(X,  V.  P) 

dlZP[f }  =  (V  |  V(x)  e  d£(f(x))  Vx  e  supp(P)}  (2.7) 

where  d£(f(x))  is  the  set  of  subgradients  of  the  pointwise  loss  l  with  respect  to  the  output  fix). 
For  differentiable  C,  this  is  just  the  partial  derivative  of  t  with  respect  to  input  f(x).  Additionally, 
supp(P)  is  the  support  of  measure  P,  that  is,  the  subset  X  such  that  every  open  neighborhood  of 
every  element  x  G  supp(P)  has  positive  measure.  This  is  only  necessary  to  formalize  the  fact  that 
the  subgradient  function  V  need  only  be  defined  over  elements  with  positive  measure. 

To  verify  this  fact,  observe  that  the  definition  of  the  subdifferential  d  l(f(x))  implies  that 


£(ff(z))  >  £(f(x) ))  +  (V(x),g(x)  -  f(x))v, 
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for  all  x  with  positive  measure.  Integrating  over  X  gives 


<x 


£(g(x))p(x)dx  >  /  £(f(x))p(x)dx  +  /  (V(x),g(x)  —  f(x))vp(x)dx, 
J  x  J  x 

which  is  exactly  the  requirement  for  a  subgradient 

nP[g}>nP[f]  +  (vn[f},g~f)P 

As  a  special  case,  we  find  that  the  subgradient  of  the  empirical  risk  7 Zp[f]  is  simply 
dllpif]  =  (V  |  V(xn)  e  d£(f(xn)),n  —  1, . . . ,  N}  . 


This  function,  defined  only  over  the  training  points  xn,  is  simply  the  gradient  of  the  loss  i  for  that 
point,  with  respect  to  the  current  output  f(xn)  at  that  point. 

These  subgradients  are  only  valid  for  the  L 2  space  corresponding  to  the  particular  probability 
distribution  P.  In  fact,  the  functional  gradient  of  a  risk  functional  evaluated  over  a  measure  P  will 
not  always  have  a  defined  subgradient  in  spaces  defined  using  another  measure  P' .  For  example 
no  subgradient  for  the  expected  loss  lZp[f]  exists  in  the  space  derived  from  P.  Similarly,  no 
subgradient  of  the  empirical  loss  7 Zp[f]  exists  in  the  L2  space  derived  from  the  true  probability 
distribution  P. 

In  addition  to  the  simpler  notion  of  convexity,  we  say  that  a  functional  1Z  is  m-strongly  convex 
if  for  all  /,  g  e  P: 

7 l\g]  >  7 Z[f]  +  ( vn[f],g  -  f)  +  j\\g-  /II2  (2.8) 

for  some  m  >  0,  and  M -strongly  smooth  if 

7 Z[g\  <  7 Z[f]  +  (VK[f],g-  f)  +  y  | \g  -  f  ||2  (2.9) 

for  some  M  >  0. 


2.2  Functional  Gradient  Descent 

We  now  outline  the  functional  gradient-based  view  of  boosting  [Mason  et  al.,  1999,  Friedman, 
2000]  and  how  it  relates  to  other  views  of  boosting.  In  contrast  to  the  standard  gradient  descent 
algorithm,  the  functional  gradient  formulation  of  boosting  contains  one  extra  step,  where  the  gra¬ 
dient  is  not  followed  directly,  but  is  instead  replaced  by  another  vector  or  function,  drawn  from  the 
pool  of  weak  predictors  Ti. 

From  a  practical  standpoint,  a  projection  step  is  necessary  when  optimizing  over  function  space 
because  the  functions  representing  the  gradient  directly  are  computationally  difficult  to  manipulate 
and  do  not  generalize  to  new  inputs  well.  In  terms  of  the  connection  to  boosting,  the  allowable 
search  directions  TL  correspond  directly  to  the  set  of  hypotheses  generated  by  a  weak  learner. 
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The  functional  gradient  descent  algorithm  is  given  in  Algorithm  2.1.  Our  work  in  this  area 
addresses  two  key  questions  that  arise  from  this  view  of  boosting.  First:  what  are  appropriate  ways 
to  implement  the  projection  operation?  Second:  how  do  we  quantify  the  performance  of  a  given 
set  of  weak  learners,  in  general,  and  how  does  this  performance  affect  the  final  performance  of  the 
learned  function  /t?  Conveniently,  the  function  space  formalization  detailed  above  gives  simple 
geometric  answers  to  these  concerns. 


Algorithm  2.1  Functional  Gradient  Descent 

Given:  starting  point  f0,  objective  1Z 
for  t  —  1, . . . ,  T  do 

Compute  a  subgradient  Vt  £  dIZ[ft_i\. 

Project  V/  onto  hypothesis  space  Ti:  h*  —  Proj  (V,  EL) 
Select  a  step  size  at. 

Update  /:  ft  =  ft_  1  +  ath*. 

end  for 


Basic  Gradient  Projection 


Projection  via  Maximum  Inner  Product 


Vt 


Projection  via  Minimum  Distance 


Figure  2.1:  Figure  demonstrating  the  geometric  intuition  underlying  (a)  the  basic  gradient  projection  op¬ 
eration  and  (b-c)  the  two  methods  for  optimizing  this  projection  operation  over  a  set  of  functions  EL.  The 
inner  product  formulization  (b)  minimizes  the  effective  angle  between  the  gradient  V  and  h,  while  the  norm 
formulization  (c)  minimizes  the  effective  distance  between  the  two  in  function  space. 


For  a  given  a  (sub)gradient  V  and  candidate  weak  learner  h,  the  closest  point  h!  along  h  can  be 
found  using  vector  projection: 


ti  = 


(2.10) 


Now,  given  a  set  of  weak  learners  EL  the  vector  h*  which  minimizes  the  error  of  the  projection 
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in  Equation  (2.10)  also  maximizes  the  projected  length: 


h* 


Proj  (V,ft) 

(V,  h) 

argmax 


(2.11) 


This  is  a  generalization  of  the  projection  operation  in  Mason  et  al.  [1999]  to  functions  other  than 
classifiers. 

For  the  special  case  when  TL  is  closed  under  scalar  multiplication,  one  can  instead  find  IT  by 
directly  minimizing  the  distance  between  V  and  h*. 


IT  =  Proj  (V,  TL) 

=  argmin  ||V  —  h\\2 
hen 


(2.12) 


thereby  reducing  the  final  projected  distance  found  using  Equation  (2.10).  This  projection  opera¬ 
tion  is  equivalent  to  the  one  given  by  Friedman  [2000],  and  is  suitable  for  function  classes  TL  that 
behave  like  regressors. 

A  depiction  of  the  geometric  intuition  behind  these  projection  operations  is  given  in  Figure  2.1. 
These  two  projection  methods  provide  relatively  simple  ways  to  search  over  any  set  of  allowable 
directions  for  the  ‘best’  descent  direction.  We  can  also  use  these  same  geometric  notions  to  quan¬ 
tify  the  performance  of  any  given  set  of  weak  learners.  This  guarantee  on  the  performance  of  each 
projection  step,  typically  referred  to  in  the  traditional  boosting  literature  as  the  edge  of  a  given 
weak  learner  set  is  crucial  to  our  convergence  analysis  of  functional  gradient  algorithms. 

For  the  projection  which  maximizes  the  inner  product  as  in  Equation  (2.11),  we  can  use  the 
generalized  geometric  notion  of  angle  to  bound  performance  by  requiring  that 


(V, /;*)  >  (cos#)||V|| 


while  the  equivalent  requirement  for  the  norm-based  projection  in  (2.12)  is 

||V-/A||2  <  (1-  (cos0)2)||V||2. 


It  can  be  seen  that  this  requirement  implies  the  first  requirement  for  arbitrary  sets  TL.  In  the  special 
case  when  TL  is  closed  under  scalar  multiplication,  these  two  requirements  are  equivalent. 

Parameterizing  by  cos  6,  we  can  now  concisely  define  the  performance  potential  of  a  set  of 
weak  learners,  which  will  prove  useful  in  later  analysis. 


Definition  2.2.1.  A  set  TL  has  edge  7  for  a  given  projected  gradient  V  if  there  exists  a  vector 
h*  e  TL  such  that  either  (V.  h *)  >  7||  V||  ||/r*||  or  ||  V  —  /r*||2  <  (1  —  72)||  V||2. 


This  definition  of  edge  is  parameterized  by  7  e  [0,1],  with  larger  values  of  edge  corresponding 
to  lower  projection  error  and  faster  algorithm  convergence.  Historically  the  edge  corresponds  to 


2.2.  FUNCTIONAL  GRADIENT  DESCENT 


21 


an  increase  in  performance  over  some  baseline.  For  instance,  in  traditional  classification  problems, 
the  edge  corresponds  to  the  edge  in  performance  over  random  guessing.  In  our  framework,  the 
baseline  performer  can  be  thought  of  as  the  predictor  h(x)  =  0.  The  definition  of  edge  given  above 
smoothly  interpolates  between  having  no  edge  over  the  zero  predictor  (7  =  0)  and  having  perfect 
reconstruction  of  the  projected  gradient  (7  =  1). 

2.2.1  Relationship  to  Previous  Boosting  Work 

Though  these  projection  operations  apply  to  any  Hilbert  space  and  set  EL,  they  also  have  convenient 
interpretations  when  it  comes  to  specific  function  classes  traditionally  used  as  weak  learners  in 
boosting. 

For  a  classification-based  weak  learner  with  outputs  in  {  —  1, +1}  and  an  optimization  over 
single  output  functions  /  :  X  — >  M,  projecting  as  in  Equation  (2.11)  is  equivalent  to  solving  the 
weighted  classification  problem 


1  x  A 

argmax—  V]  |V(xn)|l  (h(xn)  =  sgn(V(x„))) ,  (2.13) 

h  ^  N 

over  the  training  examples  xn,  with  labels  sgn(V(xn))  and  weights  |V(xn)|. 

For  arbitrary  real- valued  outputs,  the  projection  via  norm  minimization  in  Equation  (2.12)  is 
equivalent  to  solving  the  regression  problem 

1  N 

argmin  —  V  ||V(x„)  -  f(xn)  ||2 
h  ^  N 

again  over  the  training  examples  xn  with  regression  targets  V  (xn). 

Similarly,  our  notion  of  weak  learner  performance  in  Definition  2.2.1  can  be  related  to  previous 
work.  Like  our  measure  of  edge,  which  quantifies  performance  over  the  trivial  hypothesis  h(x)  = 
0,  previous  work  has  used  similar  quantities  which  capture  the  advantage  over  baseline  hypotheses. 

For  weak  learners  which  are  binary  classifiers,  as  is  the  case  in  AdaBoost  [Freund  and  Schapire, 
1997],  there  is  an  equivalent  notion  of  edge  which  refers  to  the  improvement  in  performance  over 
predicting  randomly  in  the  weighted  multiclass  projection  given  above.  We  can  show  that  Defini¬ 
tion  2.2.1  is  an  equivalent  measure. 

Theorem  2.2.2.  For  a  weak  classifier  set  El  with  outputs  in  (—1,  +1}  and  some  gradient  V, 
the  following  statements  are  equivalent:  (1)  EL  has  edge  7  for  some  7  >  0,  and  (2)  for  any 
non-negative  weights  wn  over  training  data  xn,  there  is  a  classifier  h  7  EL  which  achieves  an 
error  of  at  most  (|  —  |)  wn  on  the  weighted  classification  problem  given  in  Equation  (2.13), 
for  some  8  >  0. 
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Proof.  To  relate  the  weighted  classification  setting  and  our  inner  product  formulation,  let  weights 
wn  =  |V(xn)|  and  labels  yn  =  sgn(V(xn)).  We  examine  classifiers  h  with  outputs  in  {  —  1,  +1}. 

Consider  the  AdaBoost  weak  learner  requirement  re-written  as  a  sum  over  the  correct  exam¬ 
ples: 


n,h(xn)=yn 


n 


Breaking  the  sum  over  weights  into  the  sum  of  correct  and  incorrect  weights: 


\{  Wn~  ™n)>^J2wn- 

n,h(xn)=yn  n,h(xn)^yn  n 


The  left  hand  side  of  this  inequality  is  just  N  times  the  inner  product  (V.  h) ,  and  the  right 
hand  side  can  be  re-written  as  the  1-norm  of  the  weight  vector  w,  giving: 


N(W,h)  >  JHIj 

>  £||w||2 


Finally,  using  ||/i||  =  1  and  || V || 2  = 

showing  that  the  AdaBoost  requirement  implies  our  requirement  for  edge  'y  >  7n  >  °- 
We  can  show  the  converse  by  starting  with  our  weak  learner  requirement  and  expanding: 


(V .  h)  >  tII  V ||  \\h\\ 
Wn~  *n)  —  tI|V|| 

n,h(x„)=yn  n,h(xn)^y„ 

Then,  because  || V || 2  =  ^||te||2  and  ||w||2  >  -7=||'ie||1  we  get: 

Wn~  Wn>T^\H\i 

n,h(x„)=y„  n,h(xn)^y„ 

>  7  X/  Wn 


n 


n,h(xn)=y„ 
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giving  the  final  AdaBoost  edge  requirement. 


In  the  first  part  of  the  previous  proof,  the  scaling  of  -7=  shows  that  our  implied  edge  weakens 
as  the  number  of  data  points  increases  in  relation  to  the  AdaBoost  style  edge  requirement,  an  un¬ 
fortunate  but  necessary  feature.  This  weakening  is  necessary  because  our  notion  of  strong  learning 
is  much  more  general  than  the  previous  definitions  tailored  directly  to  classification  problems  and 
specific  loss  functions.  In  those  settings,  strong  learning  only  guarantees  that  any  dataset  can  be 
classified  with  0  training  error,  while  our  strong  learning  guarantee  gives  optimal  performance  on 
any  convex  loss  function. 

A  similar  result  can  be  shown  for  more  recent  work  which  generalizes  AdaBoost  to  multiclass 
classification  using  multiclass  weak  learners  [Mukherjee  and  Schapire,  2010].  The  notion  of  edge 
here  uses  a  cost-sensitive  multiclass  learning  problem  as  the  projection  operation,  and  again  the 
edge  is  used  to  compare  the  performance  of  the  weak  learners  to  that  of  random  guessing.  For 
more  details  we  refer  the  reader  to  the  work  of  Mukherjee  and  Schapire  [2010]. 

In  this  setting  the  weak  learners  h  are  multiclass  classifiers  over  K  outputs,  while  the  compara¬ 
ble  weak  learners  in  our  functional  gradient  setting  are  defined  over  multiple  outputs,  b!  :  X  — >•  Mfc. 


Theorem  2.2.3.  For  a  weak  multiclass  classifier  set  EL  with  outputs  in  {1, . . . ,  K},  let  the  modi¬ 
fied  hypothesis  space  EL'  contain  a  hypothesis  h!  :  X  —y  RK  for  each  h  e  EL  such  that  h'(x)k  =  1 
ifh(x)  =  k  and  h'(x)  =  —  jfi-  otherwise.  Then,  for  a  given  gradient  function  V,  the  following 
statements  are  equivalent:  (1)  EL'  has  edge  7  for  some  7  >  0,  and  (2)  EL  satisfies  the  perfor¬ 
mance  over  baseline  requirements  detailed  in  Theorem  1  of  [Mukherjee  and  Schapire,  2010]. 


^ Proof.  In  this  section  we  consider  the  multiclass  extension  of  the  previous  setting.  Instead  of 
a  weight  vector  we  now  have  a  matrix  of  weights  w  where  wnk  is  the  weight  or  reward  for 
classifying  example  xn  as  class  k.  We  can  simply  let  weights  wnk  =  V (xnk)  and  use  the 
same  weak  learning  approach  as  in  [Mukherjee  and  Schapire,  2010].  Given  classifiers  h(x) 
which  output  a  label  in  {1, . . . ,  K},  we  convert  to  an  appropriate  weak  learner  for  our  setting 
by  building  a  function  h'(x)  which  outputs  a  vector  y  e  1ZK  such  that  yk  —  1  if  h{x)  =  k  and 
ip,  =  —  j2—  otherwise. 

The  equivalent  AdaBoost  style  requirement  uses  costs  cnk  =  —wnk  and  minimizes  instead 
of  maximizing,  but  here  we  state  the  weight  or  reward  version  of  the  requirement.  More  details 
on  this  setting  can  be  found  in  [Mukherjee  and  Schapire,  2010].  We  also  make  the  additional 
assumption  that  ffk  wnk  =  0,  Vn  without  loss  of  generality.  This  assumption  is  fine  as  we  can 
take  a  given  weight  matrix  w  and  modify  each  row  so  it  has  0  mean,  and  still  have  a  valid  classi¬ 
fication  matrix  as  per  [Mukherjee  and  Schapire,  2010].  Furthermore,  this  modification  does  not 
affect  the  edge  over  random  performance  of  a  multiclass  classifier  under  their  framework. 

Again  consider  the  multiclass  AdaBoost  weak  learner  requirement  re-written  as  a  sum  of  the 
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weights  over  the  predicted  class  for  each  example: 

'y  ^  wnh(xn)  —  (^7  —  j7)  'y  ^  wnk  +  8  y  ^  Wnyn 
n  n,k  n 

we  can  then  convert  the  sum  over  correct  labels  to  the  max-norm  on  weights  and  multiply 
through  by 

y  ^  wnh(xn.)  —  J7  y  "  wnk  ~  y  ^  wnk  +  8  y  ^  wny„ 
n  n,k  n,k  n 

K  .  1  ^  ( r  ii  ii  ^  \ 

X  _  ^  A-'  Wnh(xn)  —  ^  wnk  +  1  l^nlloo  _  ^  ™nl) 

n  n,k  n  n,k 

X  -l  ^2WnhCn)  -  x  _  1  ^2/Wnk  —  X  _  1  ^  IKHoo  ~  ^  ^2wnk) 

n  n,k  n  n,k 

by  the  fact  that  the  correct  label  yn  =  arg  maxfe  wnj-. 

The  left  hand  side  of  this  inequality  is  just  the  function  space  inner  product: 

N(V,ti)  >  J^Y(8j2\\Wn\\oo  Wnk )■ 

n  n,k 

Using  the  fact  that  wnk  =  0  along  with  ||V||  <  YjU  \\wn\\2  and  Ill'll  =  \J we 
can  now  bound  the  right  hand  side: 

AW.'*'}  >  ^^DkiIcs 

n 

n 

> 

For  K  >  2  we  get  7  >  ^=,  showing  that  the  existence  of  the  AdaBoost  style  edge  implies 
the  existence  of  ours.  Again,  while  the  requirements  are  equivalent  for  some  fixed  dataset,  we 
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see  a  weakening  of  the  implication  as  the  dataset  grows  large,  an  unfortunate  consequence  of 
our  broader  strong  learning  goals. 

Now  to  show  the  other  direction,  start  with  the  inner  product  formulation: 


(V,^)  >  ^llviiH^ 

^(^2wnh(xn)  ~  Wnk)  ~  <JllVHII/l' 

n  n,k^h(xn) 

jfljTZT E “»*(*■)  -  El  E “-)  a  ^IMI Wh' 

n  n,k 


Using  Ill'll  =  y^^and||V||  >  ^  En  llw«||2  we  can  show: 


_  1  Wnh(xn)  TS-  __  1  Wnk  —  ^ 


A'  -  1 


\w. 


n  M2 


n,k 


K 


K  —  1 


Rearranging  we  get: 


K 


_  2  WnhM  —  x  —  \  ^2  ^ 


m 


n|l2 


n,k 


K 


K  —  1 


\  ^  1  v  1  A  —  1  /  A  \  ^ 

Wnh(xn)  >  J7  2^  Wnk  4  V  A  _  1  ^ 


n,k 


\W, 


n  112 


WnhM  E  ^  Wnfe  +  V  k  __  1 


m 


E 


n,k  n  n 


\W. 


n  112/ 


Next,  bound  the  2-norms  using  ||wn||2  >  ^||run||i  and  ||tun||2  >  ||wn||00  and  then  rewrite 
as  sums  of  corresponding  weights  to  show  the  multiclass  AdaBoost  requirement  holds: 


S  Wnh{xn)  >  (  K  ,K  _  1R)  Wnk  +  \j  R  _  1  S  \\W”\\oo 
n  v  n,k  n 

^  1  wnh(xn)  E  —  ^  ^  wnk  +  <5  ^  ^  Wnyn 

n  n,k  n 


L 
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2.3  Restricted  Gradient  Descent 

We  will  now  focus  on  using  the  machinery  developed  above  to  analyze  the  behavior  of  variants  of 
the  functional  gradient  boosting  method  on  problems  of  the  form: 

min 

/e-F 

where  /  is  a  sum  of  weak  learners  taken  from  some  set  TL  C  J- . 

In  line  with  previous  boosting  work,  we  will  specifically  consider  cases  where  the  edge  require¬ 
ment  in  Definition  2.2.1  is  met  for  some  7  at  every  iteration,  and  seek  convergence  results  where 
the  empirical  risk  1Z p [ft\  approaches  the  optimal  training  performance  min 72.pI/l.  For  the  rest 
of  the  work  it  is  assumed  that  function  space  operations  and  functionals  are  evaluated  with  respect 
to  the  empirical  distribution  P.  This  work  does  not  attempt  to  analyze  the  convergence  of  the  true 
risk,  or  generalization  error,  IZp[f]. 


Algorithm  2.2  Basic  Gradient  Projection  Algorithm 

Given:  starting  point  /0,  objective  72,  step  size  schedule  {//,}/_, 
for  t  —  1, . . . ,  T  do 

Compute  a  subgradient  Vt  €  <972 [ft-i]- 
Compute  h*  =  Proj  (V.  TL). 

Update  f:  ft  =  ft- 1  -  Vtjfflh*. 

end  for 


In  order  to  complete  this  analysis,  we  will  consider  a  general  version  of  the  functional  gradi¬ 
ent  boosting  procedure  given  in  Algorithm  2.1  which  we  call  restricted  gradient  descent.  While 
we  will  continue  to  use  the  notation  of  L 2  function  spaces  specifically,  the  convergence  analysis 
presented  can  be  applied  generally  to  any  Hilbert  space. 

Let  J7  be  a  Hilbert  space  and  TL  C  T  be  a  set  of  allowable  search  directions,  or  restriction  set. 
This  set,  which  in  traditional  boosting  is  the  set  of  weak  learners,  can  also  be  though  of  as  a  basis 
for  the  subset  of  T  that  we  are  actually  searching  over. 

In  the  restricted  gradient  descent  setting  we  want  to  perform  a  gradient  descent-like  procedure, 
while  only  every  taking  steps  along  search  directions  drawn  from  TL.  To  do  this,  we  will  project 
each  gradient  on  to  the  set  TL  as  in  functional  gradient  boosting.  Algorithm  2.2  gives  the  basic 
algorithm  for  projecting  the  gradients  and  taking  steps.  The  main  difference  between  this  algorithm 
and  the  previous  functional  gradient  one  is  the  extra  term  in  the  actual  (projected)  gradient  step 
which  depends  on  (V.  h*).  Otherwise  this  algorithm  is  functionally  the  same  as  the  functional 
gradient  boosting  method. 

Now,  using  the  definition  of  edge  given  in  Definition  2.2.1,  we  will  first  analyze  the  perfor¬ 
mance  of  this  restricted  gradient  descent  algorithm  on  smooth  functionals. 
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2.4  Smooth  Convergence  Results 

With  respect  to  the  functional  gradient  boosting  literature,  an  earlier  result  showing  0((  1  —  ^)T) 
convergence  of  the  objective  to  optimality  for  smooth  functionals  is  given  by  Ratsch  et  al.  [2002] 
using  results  from  the  optimization  literature  on  coordinate  descent.  Alternatively,  this  gives  a 
0(log(-))  result  for  the  number  of  iterations  required  to  achieve  error  e.  Similar  to  our  result,  this 
work  relies  on  the  smoothness  of  the  objective  as  well  as  the  weak  learner  performance,  but  uses 
the  more  restrictive  notion  of  edge  from  previous  boosting  literature  specifically  tailored  to  PAC 
weak  learners  (classifiers).  This  previous  result  also  has  an  additional  dependence  on  the  number 
of  weak  learners  and  number  of  training  examples. 

We  will  now  give  a  generalization  of  the  result  in  Ratsch  et  al.  [2002]  which  uses  our  more 
general  definition  of  weak  learner  edge.  This  result  also  relates  to  the  previous  work  of  Mason 
et  al.  [1999].  In  that  work,  a  similar  convergence  analysis  is  given,  but  the  analysis  only  states  that, 
under  similar  conditions,  the  gradient  boosting  procedure  will  eventually  converge.  Our  analysis, 
however,  considers  the  speed  of  convergence  and  the  impact  that  our  definition  of  weak  learner 
edge  has  on  the  convergence. 

Recall  the  strong  smoothness  and  strong  convexity  properties  given  earlier  in  Equations  (2.9) 
and  (2.8).  Using  these  two  properties,  we  can  now  derive  a  convergence  result  for  unconstrained 
optimization  over  smooth  functions. 

Theorem  2.4.1  (Generalization  of  Theorem  4  in  [Ratsch  et  al.,  2002]).  Let  7 Z  be  a  m-strongly 
convex  and  M-strongly  smooth  functional  over  T.  Let  TL  C  T  be  a  restriction  set  with  edge  7 
for  every  gradient  Vt  that  is  projected  on  to  TL.  Let  f*  =  argminjeJr7 Z[f],  Given  a  starting 
point  /o  and  step  size  rj  t  =  jj,  after  T  iterations  of  Algorithm  2.2  we  have: 

wa  -  nn  <  a  -  ^)t(k[/o]  -  K in). 


^ Proof  Starting  with  the  definition  of  strong  smoothness,  and  examining  the  objective  value  at 
time  t  +  lwe  have: 


nft+ 1]  <  n[ft]  +  (viz [ft],  ft.+i  -  ft)  +  —II  ft+i  -  ft 

Then,  using  ft+l  =  ft  +  we  get: 


K[ft+ 1]  <  n[}t\ 


1  (vn[ft\,ht)2 

2M  ||  ht 


2 
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Subtracting  the  optimal  value  from  both  sides  and  applying  the  edge  requirement  we  get: 

k[/.+i]  -  nn  <  nu  -  nn  -  ^\nns,]f 

From  the  definition  of  strong  convexity  we  know  ||  V77[/f]  ||2  >  2m(TZ[ft]  —  7 Z[f*})  where 
f*  is  the  minimum  point.  Rearranging  we  can  conclude  that: 

nn ■]  -  nn  <  mr  -  nnn  - 

Recursively  applying  the  above  bound  starting  at  t  —  0  gives  the  final  bound  on  7 Z[fr\  — 

[A].  ■ 

The  result  above  holds  for  the  fixed  step  size  -T  as  well  as  for  step  sizes  found  using  a  line 
search  along  the  descent  direction,  as  they  will  only  improve  the  convergence  rate,  because  we  are 
considering  the  convergence  at  each  iteration  independently  in  the  above  proof. 

Theorem  2.4.1  gives,  for  strongly  smooth  objective  functionals,  a  convergence  rate  of  0 ( ( 1  — 
-tjp1)11)-  This  is  very  similar  to  the  0((1  —  A^2)n)  convergence  of  AdaBoost  [Freund  and  Schapire, 
1997],  or  0((1  —  ^)T)  convergence  rate  given  by  Ratsch  et  al.  [2002],  as  all  require  0(log(7)) 
iterations  to  get  performance  within  e  of  the  optimal  result. 

While  the  AdaBoost  result  generally  provides  tighter  bounds,  this  relatively  naive  method  of 
gradient  projection  is  able  to  obtain  reasonably  competitive  convergence  results  while  being  ap¬ 
plicable  to  a  much  wider  range  of  problems.  This  is  expected,  as  the  proposed  method  derives 
no  benefit  from  loss-specific  optimizations  and  can  use  a  much  broader  class  of  weak  learners. 
This  comparison  is  a  common  scenario  within  optimization:  while  highly  specialized  algorithms 
can  often  perform  better  on  specific  problems,  general  solutions  often  obtain  equally  impressive 
results,  albeit  less  efficiently,  while  requiring  much  less  effort  to  implement. 

2.4.1  Non-smooth  Degeneration 

Unfortunately,  the  naive  approach  to  restricted  gradient  descent  breaks  down  quickly  in  more  gen¬ 
eral  cases  such  as  non-smooth  objectives.  Consider  the  following  example  objective  (also  depicted 
in  Figure  2.2  over  two  points  xi,x2\  7 Z[f]  =  2\f(x1)\  +  |/(x2)|.  For  this  problem,  a  valid  subgra¬ 
dient  is  V  such  that  V(xi)  =  2  sgn(o;i)  and  V(x2)  =  sgn^).  We  will  assume  that  sgn(O)  =  1,  to 
give  us  a  unique  subgradient  for  the  x\  =  0  or  x2  =  0  case. 

Now  consider  the  hypothesis  set  h  G  TL  such  that  either  h(x i)  G  (  —  1,  +1}  and  h(x2)  =  0,  or 
h(x i)  =  0  and  h(x2)  G  {  —  1,  +1}-  The  algorithm  will  always  select  h*  such  that  h*(x2)  =  0  when 
projecting  gradients  from  the  example  objective,  and  the  set  TL  will  always  have  reasonably  large 
edge  with  respect  to  V. 

This  procedure,  however,  gives  a  final  function  with  perfect  performance  on  x  \  and  arbitrarily 
poor  unchanged  performance  on  x2,  depending  on  choice  of  starting  function  /0.  Even  if  the  loss 
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Figure  2.2:  A  demonstration  of  a  non-smooth  objective  for  which  the  basic  restricted  gradient  descent 
algorithm  fails.  The  possible  weak  predictors  and  optimal  value  f*  are  depicted  in  (a),  while  (b)  gives  the 
result  of  running  the  basic  restricted  gradient  algorithm  on  this  problem  for  an  example  starting  point  /o  and 
demonstrates  the  optimality  gap.  This  gap  is  due  to  the  fact  that  the  algorithm  will  only  ever  select  h\  or 
—hi  as  possible  descent  directions,  as  depicted  in  (c)  and  (d). 


on  training  point  x2  is  substantial  due  to  a  bad  starting  location,  naively  applying  the  basic  gradient 
projection  algorithm  will  not  correct  it. 

An  algorithm  which  greedily  projects  subgradients  of  12,  such  as  Algorithm  2.2,  will  not  be 
able  to  obtain  strong  performance  results  for  cases  like  these.  The  algorithms  in  the  next  section 
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overcome  this  obstacle  by  projecting  modified  versions  of  the  subgradients  of  the  objective  at  each 
iteration. 


2.5  General  Convex  Convergence  Results 

For  the  convergence  analysis  of  general  convex  functions  we  now  switch  to  analyzing  the  average 
optimality  gap: 

-nr]]. 

t= i 

where  /*  =  argmin  1  72  f] 's  the  fixed  hypothesis  which  minimizes  loss. 

/6-F 

By  showing  that  the  average  optimality  gap  approaches  0  as  T  grows  large,  for  decreasing  step 
sizes,  it  can  be  shown  that  the  optimality  gap  7 Z[ft\  —  T2[f*}  also  approaches  0. 

This  analysis  is  similar  to  the  standard  no-regret  online  learning  approach,  but  we  restrict  our 
analysis  to  the  case  when  lZt  =  12.  This  is  because  the  true  online  setting  typically  involves  receiv¬ 
ing  a  new  dataset  at  every  time  t,  and  hence  a  different  data  distribution  Pt,  effectively  changing  the 
underlying  L 2  function  space  of  operations  such  as  gradient  projection  at  every  time  step,  making 
comparison  of  quantities  at  different  time  steps  difficult  in  the  analysis.  The  convergence  analysis 
for  the  online  case  is  beyond  the  scope  of  our  work  and  is  not  presented  here. 

The  convergence  results  to  follow  are  similar  to  previous  convergence  results  for  the  standard 
gradient  descent  setting  [Zinkevich,  2003,  Hazan  et  al.,  2006],  but  with  a  number  of  additional  error 
terms  due  to  the  gradient  projection  step.  Sutskever  [2009]  has  previously  studied  the  convergence 
of  gradient  descent  with  gradient  projection  errors  using  an  algorithm  similar  to  Algorithm  2.2,  but 
the  analysis  does  not  focus  on  the  weak  to  strong  learning  guarantee  we  seek.  1  In  order  to  obtain 
this  guarantee  we  now  present  two  new  algorithms. 

Our  first  general  convex  solution,  shown  in  Algorithm  2.3,  overcomes  this  issue  by  using  a 
meta-boosting  strategy.  At  each  iteration  t  instead  of  projecting  the  gradient  V,  onto  a  single 
hypothesis  h*,  we  use  the  naive  algorithm  to  construct  h*  out  of  a  small  number  of  restricted  steps, 
optimizing  over  the  distance  ||  V*  —  h*\\2.  By  increasing  the  number  of  weak  learners  trained  at 
each  iteration  over  time,  we  effectively  decrease  the  gradient  projection  error  at  each  iteration.  As 
the  average  projection  error  approaches  0,  the  performance  of  the  combined  hypothesis  approaches 
optimal.  We  can  now  prove  convergence  results  for  this  algorithm  for  both  strongly  convex  and 
convex  functionals. 

Theorem  2.5.1.  Let  12  be  a  m-strongly  convex  functional  over  F.  Let  H  C  IF  be  a  re- 


'in  fact,  Sutskever’s  convergence  results  can  be  used  to  show  that  the  bound  on  training  error  for  the  basic  gradient 
projection  algorithm  asymptotically  approaches  the  average  error  of  the  weak  learners,  only  indicating  that  you  are 
guaranteed  to  find  a  hypothesis  which  does  no  worse  than  any  individual  weak  learner,  despite  its  increased  complexity. 
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Algorithm  2.3  Repeated  Gradient  Projection  Algorithm 

Given:  starting  point  f0,  objective  7 Z,  step  size  schedule  {//,}/!, 
for  t  —  1, . . . ,  T  do 

Compute  subgradient  V*  G  d1Z[ft-i\. 

Let  V'  =  Vt,  h*  =  0. 
for  k  —  1, . . . ,  t  do 

Compute  h*k  =  Proj  (V',  EL). 

h*  <-  h*  +  h*k. 

V  <-  V  -  h*k. 

end  for 

Update  f:  ft  =  ft- i  -  Vt.h*. 

end  for 

striction  set  with  edge  7  for  every  V'  that  is  projected  on  to  EL.  Let  ||V7£[/]||  <  G.  Let 
f*  =  argmin fejrlZ[f].  Given  a  starting  point  f0  and  step  size  rjt  =  777,  after  T  iterations  of 
Algorithm  2.3  we  have: 

1  J-  ^C2  1  —  'y2 

r  2 

Proof.  First,  we  start  by  bounding  the  potential  \\ft  —  /*||  ,  similar  to  the  potential  function 

arguments  of  Zinkevich  [2003]  and  Hazan  et  al.  [2006],  but  with  a  different  descent  step: 

\\ft+i-r\\2<  \\ft-vt+i(ht)  -r\\2 

=  11  ft-  m2  +  vh\M2  -  2^+i  (/t  -  /*,  ht  -  v*>  -  2  Vt+1(ft  -  r,  vt> 

<r  -  ft,  vt)  >  7— — \\ft+i  -  r  112  -  ^—\\ft  -  r  112  -  ^ini2  -  </*  -  hx  -  v*> 

277*4-1  277*4-1  2 
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Using  the  definition  of  strong  convexity  and  summing: 


£  nn  >£>[/«]  +  £  </•  -  ft,  V,>  +  £  J II/-  - 


t= 1 


t=  1 
T 


t=  1 
T 


t=l 


>  £  K[/J  +  £  i||/(  -  f||2U  -  2-  +  m) 


t=l 


i=l 


Vt  Vt+1 


£^IM 2 -£(/*-/«,  fc.-v,) 


t=i 


t=i 


Setting  7]t  =  ^  and  use  bound  ||/it||  <  2||  Vt||  <  2G  : 


£«[/*]  >!>[/<]  +  £  fi||/s  -  rn2  -  £  V, lift,"2 


t=l 


t=  1 

>£>[/.] 

t=i 

T 


t=  1 


t=l 


8mf 


£{r-/<,fc.-v,> 


t=i 


T  r 


5G2  v  1  -  V 

2m  t 

t= i  t=i 

5G2 


<r-/.,A.-V,))-|i||/*-/,||2 


s£k[/<]-^/(1+1“t)-£ 


t=i 


t=i  L 


<r-/.,A«-v,))-|i|ir-/t|ii 


Next  we  bound  the  remaining  term  by  using  a  variant  of  the  Polarization  identity.  First  we 
expand 


a/cv4 - pi? 

Vc 


Cm  -m2  1 


=  2l|i411  +2^I|B|1  <AB)- 


Then,  using  the  fact  that 


>  0,  we  can  bound: 


i||B||2  >  (A,B)  -  ^\\A\f. 
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Using  that  bound  and  the  result  from  Theorem  2.4.1  we  can  bound  the  error  at  each  step  t: 


f>[/*]  >  X>[/<] 

t= i  t=  i 

>X>[/«] 

t=  1 

>£>[/.] 

t=l 


t=l 

5G2  ,  _  5G2  v^.  2n, 


K/^'2 

—  (1  +  lnT) 
2m  V  ; 


5G2 1  -  72 
2m  72 


giving  the  final  bound. 

This  bound  can  be  improved  slightly  by  instead  using  the  step  size  pt  =  in  which  case 
the  final  bound  will  be 


i>[/i 

t= i 


T 


>A>i/j 


2G2A 

m 


(1  +  lnT) 


G2  A  1  -  72 
2m  A  —  1  72 


Minimizing  this  over  A  gives  the  optimal  value  step  of 


A 


1— 72 

72 


4(1  +  In  T) 


+  1. 


The  proof  relies  on  the  fact  that  as  the  number  of  iterations  increases,  our  gradient  projection 
error  approaches  0  at  the  rate  given  in  Theorem  2.4.1,  causing  the  behavior  of  Algorithm  2.3  to 
approach  the  standard  gradient  descent  algorithm.  The  additional  error  term  in  the  result  is  a  bound 
on  the  geometric  series  describing  the  errors  introduced  at  each  time  step. 


Theorem  2.5.2.  Let  IZbe  a  convex  functional  over  J~.  Let  Ti  C  T  be  a  restriction  set  with  edge 
7  for  every  V'  that  is  projected  on  to  TL.  Let  ||V72.[/]||  <  G  and  ||/||  <  F  for  all  f  e  T .  Let 
f*  =  arg  ininyeJ-72.[/].  Given  a  starting  point  f0  and  step  size  rjt  —  after  T  iterations  of 
Algorithm  2.3  we  have: 


2  G2 

VT 


+  2  FG 


1-72 
7 2T  ‘ 


^ Proof  Like  the  last  proof,  we  start  with  the  altered  potential  and  sum  over  the  definition  of 
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convexity: 


t= i 


t= i 


t= i 


Vt  Vt+ 1 


+  m) 


E^ini  2  -  E  </*-/■•  fc>  -  v«) 


Setting  r]t  =  ^  and  using  bound  ||/it||  <  2 1|  V* ||  <  2G  and  the  result  from  Theorem  2.4.1 
we  can  bound  the  error  at  each  step  t: 


!>[/*]  >I>[ 


T  i  T 


t=l 


t=l 

T 

>Eri 

t=i 
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>I>[ 

t=  i 


-|l/r  -  /* II2  -  2G2  53  -=  -  XI  </*  -  />.  '*<  -  v<) 
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2U 
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1-72 

72 


L 


giving  the  final  bound. 


Again,  the  result  is  similar  to  the  standard  gradient  descent  result,  with  an  added  error  term 
dependent  on  the  edge  7. 

In  practice,  the  large  number  of  weak  learners  trained  using  this  method  could  become  pro¬ 
hibitive,  and  the  performance  of  this  algorithm  in  practice  is  often  much  better  than  that  given  by 
the  derived  bounds  above. 

In  this  case,  an  alternative  version  of  the  repeated  projection  algorithm  allows  for  a  variable 
number  of  weak  learners  to  be  trained  at  each  iteration.  An  accuracy  threshold  for  each  gradient 
projection  could  be  derived  given  a  desired  accuracy  for  the  final  hypothesis,  and  this  threshold  can 
be  used  to  train  weak  learners  at  each  iteration  until  the  desired  accuracy  is  reached.  In  practice, 
this  would  allow  for  only  the  number  of  weak  learners  required  to  reach  a  given  accuracy  target  to 
be  trained,  reducing  the  total  number  of  weak  learners. 

Algorithm  2.4  gives  another  approach  for  optimizing  over  convex  objectives  which  may  also 
address  this  issue  of  the  increasingly  large  number  of  weak  learners.  Like  the  previous  approach, 
the  projection  error  at  each  time  step  is  used  again  in  projection,  but  a  new  step  is  not  taken 
immediately  to  decrease  the  projection  error.  Instead,  this  approach  keeps  track  of  the  residual 
error  left  over  after  projection  and  includes  this  error  in  the  next  projection  step.  This  forces  the 
projection  steps  to  eventually  account  for  past  errors,  preventing  the  possibility  of  systematic  error 
being  adversarially  introduced  through  the  weak  learner  set. 

As  with  Algorithm  2.3,  we  can  derive  similar  convergence  results  for  strongly-convex  and 
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Algorithm  2.4  Residual  Gradient  Projection  Algorithm 

Given:  starting  point  f0,  objective  7 Z,  step  size  schedule  {r)t}J=i 

Let  A  =  0. 

for  t  —  1, . . . ,  T  do 

Compute  subgradient  V*  G  dlZ[ft_i\. 

A  =  A  +  Vt. 

Compute  h*  =  Proj  (A,  TL). 

Update  /:  /t  =  /i_i  - 

Update  residual:  A  =  A  —  h* 

II  II 

end  for 


general  convex  functionals  for  this  new  residual-based  algorithm. 

Theorem  2.5.3.  Let  1Z  be  a  m-strongly  convex  functional  over  T .  Let  TL  C  T  be  a  re¬ 
striction  set  with  edge  7  for  every  A  that  is  projected  on  to  TL.  Let  ||V7£[/]||  <  G.  Let 
f*  =  argmin feJrTl[f].  Let  c  =  ^7.  Given  a  starting  point  f0  and  step  size  tjt  =  77,  after 
T  iterations  of  Algorithm  2.4  we  have: 

^E[K[/J-W[/*]]<^(l  +  lnT+|). 

t= 1 


n 


Proof.  Like  the  proof  of  Theorem  2.5.1,  we  again  use  a  potential  function  and  sum  over  the 
definition  of  convexity: 


t= 1 


T  T  T 

£klt l  >£-«[/<]  +  £  h\f,  -  r\\2(f  -  ft-  + «.)- 

t= 1  t=i  't+1 

T  T  T—l 

£  ^llh.112  -  £  (/’  -  /.,  h,  -  (A,  +  V,)>  -£(/*-  /t+i,  A,+i) 
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T  T  T—l  T—l 

£  ^INI2  -  £  (/•  -  ft.  h,  -  (A,  +  V,)>  -£{/*-/.,  At+1)  -  £  (,,<+ a,  At+1> 


t=l 


1=1 


1=0 


1=0 


where  hf  is  the  augmented  step  taken  in  Algorithm  2.4. 
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Setting  rjt  =  —  and  use  bound  \\ht\\  <  ||  V* ||  <  G,  along  with  At+1  =  (At  +  Vt)  —  ht: 


A 

T  T 


mT 


£>[. /*]  >  E^mE^ini2  -  (</*  -  /t+i.  a,+i)  -  ^nr  -  /t+i||2)  -  (mh„  a1+1> 


t=i 


t=i 


t=i 


t=i 


We  can  bound  the  norm  of  At  by  considering  that  (a)  it  start  at  0  and  (b)  at  each  time  step 
it  increases  by  at  most  Vf  and  is  multiplied  by  1  —  y2.  This  implies  that  ||At||  <  cG  where 

c=  MEL,  <  4. 

72 

From  here  we  can  get  a  final  bound: 

T  T  „2(~i2  Or2r2  n2r2 

2— — —a + inr) 

t= i  t= i 


mT  m 


L 


Theorem  2.5.4.  Let  12  be  a  convex  functional  over  T.  Let  H  C  T  be  a  restriction  set  with  edge 
7  for  every  A  that  is  projected  on  to  TL.  Let  ||V7£[/]||  <  G  and  \\f\\  <  F  for  cdl  f  G  T .  Let 
f*  =  arg  rniriy e  j-7?.  [/].  Let  c  =  V-  Given  a  starting  point  f0  and  step  size  >h  =  A,  after  T 
iterations  of  Algorithm  2.4  we  have: 


T 

t=i 


F2  c2G2  c2G 2 

2 Vt  Vt  2 r§  ’ 


n 


Proof  Similar  to  the  last  few  proofs,  we  get  a  result  similar  to  the  standard  gradient  version, 
with  the  error  term  from  the  last  proof: 


I>[/i  >!>[/*] +  £di/t-r 


t= 1 


*=1 

T 


i=l 


Vt  Vt+ 1 ' 


E  ^ll^ll2  -  ((/*  -  /t+1,  At+1)  -  A || /•  _  /T+1f)  _  £  A1+1) 
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t=i 


Using  the  bound  on  ||  At||  <  c  from  above  and  setting  rjt  —  -F: 


T 


t=i 


t= 1 


2 


2.6.  EXPERIMENTS 


37 


Lc 


giving  the  final  bound. 


Again,  the  results  are  similar  bounds  to  those  from  the  non-restricted  case.  Like  the  previous 
proof,  the  extra  terms  in  the  bound  come  from  the  penalty  paid  in  projection  errors  at  each  time 
step,  but  here  the  residual  serves  as  a  mechanism  for  pushing  the  error  back  to  later  projections. 
The  analysis  relies  on  a  bound  on  the  norm  of  the  residual  A,  derived  by  observing  that  it  is 
increased  by  at  most  the  norm  of  the  gradient  and  then  multiplicatively  decreased  in  projection  due 
to  the  edge  requirement.  This  bound  on  the  size  of  the  residual  presents  itself  in  the  c  term  present 
in  the  bound. 

In  terms  of  efficiency,  these  two  algorithms  are  similarly  matched.  For  the  strongly  convex 
case,  the  repeated  projection  algorithm  uses  0(T2)  weak  learners  to  obtain  an  average  regret  of 
O  ( ljjf-  +  tij  ) ,  while  the  residual  algorithm  uses  0(T)  weak  learners  and  has  average  regret  O  ( ^ ) . 
The  major  difference  lies  in  frequency  of  the  gradient  evaluation,  where  the  repeated  projection 
algorithm  evaluates  the  gradient  much  less  often  than  the  than  the  residual  algorithm. 


Figure  2.3:  Test  set  loss  vs  number  of  weak  learners  used  for  a  maximum  margin  structured  imitation 
learning  problem  for  all  three  restricted  gradient  algorithms.  The  algorithms  shown  are  the  naive  use  of  the 
basic  projection  (black  dashed  line),  repeated  projection  steps  (red  solid  line),  and  the  residual  projection 
algorithm  (blue  long  dashed  line). 


2.6  Experiments 

We  now  present  experimental  results  for  these  new  algorithms  on  three  tasks:  an  imitation  learning 
problem,  a  ranking  problem  and  a  set  of  sample  classification  tasks. 
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Figure  2.4:  Test  set  disagreement  (fraction  of  violated  constraints)  vs  number  of  weak  learners  used  for 
the  MSLR-WEB10K  ranking  dataset  for  all  three  restricted  gradient  algorithms.  The  algorithms  shown  are 
the  naive  use  of  the  basic  projection  (black  dashed  line),  repeated  projection  steps  (red  solid  line),  and  the 
residual  projection  algorithm  (blue  long  dashed  line). 


The  first  experimental  setup  is  an  optimization  problem  which  results  from  the  Maximum  Mar¬ 
gin  Planning  [Ratliff  et  al.,  2009]  approach  to  imitation  learning.  In  this  setting,  a  demonstrated 
policy  is  provided  as  example  behavior  and  the  goal  is  to  learn  a  cost  function  over  features  of  the 
environment  which  produce  policies  with  similar  behavior. 

Previous  attempts  in  the  literature  have  been  made  to  adapt  boosting  to  this  setting  [Ratliff 
et  al.,  2009,  Bradley,  2009],  similar  to  the  naive  algorithm  presented  here,  but  no  convergence 
results  for  this  settings  are  known. 

Figure  2.3  shows  the  results  of  naively  applying  the  basic  projected  gradient  algorithm,  as  well 
as  running  the  two  new  algorithms  presented  here  on  a  sample  planning  dataset  from  this  domain. 
The  weak  learners  used  were  neural  networks  with  5  hidden  units  each. 

The  second  experimental  setting  is  a  ranking  task  from  the  Microsoft  Learning  to  Rank  Datasets, 
specifically  MSLR-WEB10K  [Microsoft,  2010],  using  the  ranking  version  of  the  hinge  loss  and 
decision  stumps  as  weak  learners.  Figure  2.4  shows  the  test  set  disagreement  (the  percentage  of 
violated  ranking  constraints)  plotted  against  the  number  of  weak  learners. 

As  a  final  test,  we  ran  our  boosting  algorithms  on  several  multiclass  classification  tasks  from 
the  UCI  Machine  Learning  Repository  [Frank  and  Asuncion,  2010],  using  the  ‘connect4’,  ‘letter’, 
‘pendigits’  and  ‘satimage’  datasets.  All  experiments  used  the  multiclass  extension  to  the  hinge  loss 
[Crammer  and  Singer,  2002],  along  with  multiclass  decision  stumps  for  the  weak  learners.  Results 
are  given  in  Figure  2.5. 

Of  particular  interest  are  the  experiments  where  the  naive  approach  to  restricted  gradient  de- 
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Figure  2.5:  Test  set  classification  error  on  multiclass  classification  experiments  over  the  UCI  ‘connect4\ 
‘letter’,  ‘pendigits’  and  ‘satimage’  datasets.  The  algorithms  shown  arc  the  naive  use  of  the  basic  projection 
(black  dashed  line),  repeated  projection  steps  (red  solid  line),  and  the  residual  projection  algorithm  (blue 
long  dashed  line). 


scent  clearly  fails  to  converge  (‘connect4’  and  ‘letter’).  In  line  with  the  presented  convergence 
results,  both  non-smooth  algorithms  approach  optimal  training  performance  at  relatively  similar 
rates,  while  the  naive  approach  cannot  overcome  the  particular  conditions  of  these  datasets  and 
fails  to  achieve  strong  performance.  In  these  cases,  the  naive  approach  repeatedly  cycles  through 
the  same  weak  learners,  impeding  further  optimization  progress. 
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Chapter  3 

Functional  Gradient  Extensions 


In  this  chapter  we  detail  two  extensions  to  the  functional  gradient  techniques  described  in  Chap¬ 
ter  2.  The  first  extension  covers  the  structured  prediction  setting,  where  each  example  has  a  corre¬ 
sponding  structured  output  we  wish  to  generate,  consisting  of  a  number  of  individual  predictions 
over  the  relevant  structural  features  of  the  problem.  For  example,  the  structured  output  might  be 
a  label  for  every  word  in  a  sentence,  or  every  pixel  in  an  image.  Most  notably  different  from  the 
previous  chapter,  in  this  domain  we  will  learn  boosted  learners  that  rely  on  the  values  of  previous 
predictions  at  each  iteration  of  boosting,  so  the  learned  function  can  account  between  relationships 
between  structurally  related  predictions. 

This  change  introduces  a  unique  type  of  overfitting  which  often  results  in  a  cascade  of  failures 
in  practice,  due  to  the  reliance  on  potentially  overfit  previous  predictions  at  training  time.  To 
address  this,  we  introduce  a  second  extension  which  is  a  stacked  version  of  functional  gradient 
methods.  This  algorithm  improves  robustness  to  the  overfitting  that  occurs  when  predictions  from 
the  early  weak  learners  in  are  reused  as  input  features  to  later  weak  learners. 


3.1  Structured  Boosting 

3.1.1  Background 

In  the  structured  prediction  setting,  we  are  given  inputs  x  e  X  and  associated  structured  outputs 
y  e  y.  The  goal  is  to  learn  a  function  /  :  X  — >■  y  that  minimizes  some  risk  7 Z[f],  typically 
evaluated  pointwise  over  the  inputs: 

K[f]=Ex[£(f(x))],  (3.1) 

similar  to  the  pointwise  loss  discussed  in  the  previous  chapter  in  Equation  (2.4). 

We  will  further  assume  that  each  input  and  output  pair  has  some  underlying  structure,  such 
as  the  graph  structure  of  graphical  models,  that  can  be  utilized  to  predict  portions  of  the  output 
locally.  Let  j  index  these  structural  elements.  We  then  assume  that  a  final  structured  output  y  can 
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be  represented  as  a  variable  length  vector  (yi, ... ,  y,j ) ,  where  each  element  y?  lies  in  some  vector 
space  yj  £  3/.  For  example,  these  outputs  could  be  the  probability  distribution  over  class  labels 
for  each  pixel  in  an  image,  or  distributions  of  part-of-speech  labels  for  each  word  in  a  sentence. 
Similarly  we  can  compute  some  features  Xj  representing  the  portion  of  the  input  which  corresponds 
to  a  given  output,  such  as  features  computed  over  a  neighborhood  around  a  pixel  in  an  input  image. 

As  another  example,  consider  labeling  tasks  such  as  part-of-speech  tagging.  In  this  domain,  we 
are  given  a  set  of  input  sentences  X,  and  for  each  word  j  in  a  given  sentence,  we  want  to  output  a 
vector  ijj  £  RK  containing  the  scores  with  respect  to  each  of  the  K  possible  part-of-speech  labels 
for  that  word.  This  sequence  of  vectors  for  each  word  is  the  complete  structured  prediction  y.  An 
example  loss  function  for  this  domain  would  be  the  multiclass  log-loss,  averaged  over  words,  with 
respect  to  the  ground  truth  parts-of-speech. 

Along  with  the  encoding  of  the  problem,  we  also  assume  that  the  structure  can  be  used  to 
reduce  the  scope  of  the  prediction  problem,  as  in  graphical  models.  One  common  approach  to 
generating  predictions  on  these  structures  is  to  use  a  policy-based  or  iterative  decoding  approach 
[Cohen  and  Carvalho,  2005,  Daume  III  et  al.,  2009,  Tu  and  Bai,  2010,  Socher  et  al.,  2011,  Ross 
et  al.,  2011],  instead  of  probabilistic  inference  over  a  graphical  model.  In  order  to  model  the  con¬ 
textual  relationships  among  the  outputs,  these  iterative  approaches  commonly  perform  a  sequence 
of  predictions,  where  each  update  relies  on  previous  predictions  made  across  the  structure  of  the 
problem. 

Let  N (j )  represent  the  locally  connected  elements  of  j,  such  as  the  locally  connected  factors 
of  a  node  j  in  a  typical  graphical  model.  For  a  given  node  j,  the  predictions  over  the  neighboring 
nodes  llN(j)  can  then  be  used  to  update  the  prediction  for  that  node.  For  example,  in  the  character 
recognition  task,  the  predictions  for  neighboring  characters  can  influence  the  prediction  for  a  given 
character,  and  be  used  to  update  and  improve  the  accuracy  of  that  prediction. 

In  the  iterative  decoding  approach  a  predictor  <p  is  iteratively  used  to  update  different  elements 
yj  of  the  final  structured  output: 

Vj  =  (t>ixjiyN{j))i  (3.2) 

using  both  the  features  Xj  of  that  element  and  current  predictions  fjN(j)  of  the  neighboring  elements. 
In  the  message  passing  analogy,  these  current  predictions  are  the  messages  that  are  passed  between 
nodes,  and  used  for  updating  the  current  prediction  at  that  node. 

A  complete  policy  then  consists  of  a  strategy  for  selecting  which  elements  of  the  structured 
output  to  update,  coupled  with  the  predictor  for  updating  the  given  outputs.  Typical  approaches 
include  randomly  selecting  elements  to  update,  iterating  over  the  structure  in  a  fixed  ordering,  or 
simultaneously  updating  all  predictions  at  all  iterations.  As  shown  by  Ross  et  al.  [2011],  this  itera¬ 
tive  decoding  approach  can  is  equivalent  to  message  passing  approaches  used  to  perform  inference 
over  graphical  models,  where  each  update  encodes  a  single  set  of  messages  passed  to  one  node 
in  the  graphical  model.  For  example,  the  message  passing  behavior  of  Loopy  Belief  Propagation 
[Pearl,  1988]  can  be  described  by  this  iterative  decoding  approach  [Ross  et  al.,  2011]. 
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3.1.2  Weak  Structured  Predictors 


We  now  adapt  the  functional  gradient  methods  discussed  in  Chapter  2  to  the  structured  prediction 
setting  by  detailing  an  additive  structured  predictor.  To  accomplish  this,  we  will  adapt  the  policy- 
based  iterative  decoding  approach  discussed  in  Section  3.1.1  to  use  an  additive  policy  instead  of 
one  which  replaces  previous  predictions. 

In  the  iterative  decoding  described  previously,  recall  that  we  have  two  components,  one  for 
selecting  which  elements  to  update,  and  another  for  updating  the  predictions  of  the  given  elements. 
Let  S1  be  the  set  of  components  selected  for  updating  at  iteration  t.  For  current  predictions  y1  we 
can  re-write  the  policy  for  computing  the  predictions  at  the  next  iteration  of  the  iterative  decoding 
procedure  as: 

m=  U(2T,^0))  if  J  e  ^ 

:I  | y*  otherwise 

The  additive  version  of  this  policy  instead  uses  weak  predictors  h,  each  of  which  maps  both 
the  input  data  and  previous  structured  output  to  a  more  refined  structured  output,  h  :  X  x  3^  3k 

yt+1  =  yt  +  h(x,yt).  (3.4) 


Note  that,  unlike  in  Chapter  2,  we  are  now  augmenting  each  weak  learner  h  to  also  take  as  input 
the  current  prediction,  y. 

We  can  build  a  weak  predictor  h  which  performs  the  same  actions  as  the  previous  replacement 
policy  by  considering  weak  predictors  with  two  parts:  a  function  hs  which  selects  which  struc¬ 
tural  elements  to  update,  and  a  predictor  hP  which  runs  on  the  selected  elements  and  updates  the 
respective  pieces  of  the  structured  output. 

The  selection  function  hs  takes  in  an  input  x  and  previous  prediction  y  and  outputs  a  set  of 
structural  nodes  S  =  {ji,  j2, . . .}  to  update.  For  each  structural  element  selected  by  hs,  the  predic¬ 
tor  hP  takes  the  place  of  <f>  in  the  previous  policy,  taking  {x3,  yN(j ))  and  computing  an  update  for 
the  prediction  yr 

Returning  to  the  part-of-speech  tagging  example,  possible  selection  functions  would  select  dif¬ 
ferent  chunks  of  the  sentence,  either  individual  words  or  multi-word  phrases  using  some  selection 
criteria.  Given  the  set  of  selected  elements,  a  prediction  function  would  take  each  selected  word 
or  phrase  and  update  the  predicted  distribution  over  the  part-of-speech  labels  using  the  features  for 
that  word  or  phrase. 

Using  these  elements  we  can  write  the  weak  predictor  h,  which  produces  a  structured  output 

(H-) h(-)j),  as 


h(r  _  j  M^,Pnu))  ifJ  e  hs(x,y) 
'Hu  y  )j  —  s  „  ...  i 

I  0  otherwise 

or  alternatively  we  can  write  this  using  an  indicator  function: 

=  1(3  e 


(3.5) 


(3.6) 
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3.1.3  Functional  Gradient  Projection 

In  order  to  use  the  functional  gradient  framework  discussed  in  the  previous  chapter,  we  need  to  be 
able  to  complete  the  projection  operation  over  the  TL  given  in  Equations  (2.11-2.12).  We  assume 
that  we  are  given  a  fixed  set  of  possible  selection  functions,  Hs,  and  a  set  of  L  learning  algorithms, 
where  A  :  V  — >  Tip  generates  a  predictor  given  a  training  set  D. 

In  practice,  these  algorithms  may  be  different  methods  for  generating  regressors,  classifiers, 
or  other  weak  learners  tailored  to  the  specific  problem.  The  reason  for  this  distinction  between 
the  selection  and  prediction  functions  is  that,  in  practice,  the  selection  functions  are  often  problem 
specific  and  cannot  be  trained  to  target  a  given  gradient,  like  weak  learners  often  are.  Instead, 
we  will  use  an  enumeration  strategy  to  find  the  best  selection  function,  and  train  the  prediction 
functions  to  target  a  given  functional  gradient. 

Consider  the  loss  function  given  in  Equation  (3.1).  This  function  is  actually  a  function  of  the 
vector  of  outputs  (jj\, . . .).  Recall  from  the  previous  chapter  and  Equation  (2.7)  that  the  gradient  V 
at  each  input  x  will  simply  be  the  gradient  of  the  loss  t  at  the  current  output, 


Vl»  =  V/w  <(/(*)). 


In  the  structured  prediction  setting,  we  are  actually  concerned  with  each  individual  component 
of  the  gradient  V (x)3,  corresponding  to  output  y3 .  This  gradient  component  is  simply 


V(*)i 


di(J  (x)) 
df(x)j  ’ 


(3.7) 


or,  the  gradient  of  the  loss  with  respect  to  the  partial  structured  prediction  y3  =  f(x)j. 

Given  a  fixed  selection  function  hs  and  current  predictions  y,  we  can  build  a  dataset  appro¬ 
priate  for  training  weak  predictors  hp  as  follows.  In  order  to  minimize  the  projection  error  in 
Equation  (2.12)  for  a  predictor  h  of  the  form  in  Equation  (3.6),  we  only  need  to  find  the  prediction 
function  hp  that  minimizes 


hp  =  argminE* 

/ipG'Hp 


E  live*),- 

jehs(x,y) 


h 


P  w.  UN(j) 


(3.8) 


This  optimization  problem  is  equivalent  to  minimizing  weighted  least  squares  error  over  the 

n n tn Qpt 

B  =  U  U  {(’fe.vw,-)}, 

X  jehs(x,y )  (3.9) 

=  gradient (/,  hs), 

where  w3  =  U){x3,  yN(j))  is  a  feature  descriptor  for  the  given  structural  node,  and  V (x)3  is  its 
target.  In  order  to  model  contextual  information,  i/j  is  drawn  from  both  the  raw  features  x3  for  the 
given  element  and  the  previous  locally  neighboring  predictions  ijN(j)- 
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Now,  we  can  use  this  modified  weak  learner  and  projection  operation  in  the  functional  gradient 
descent  framework  from  Chapter  2.  In  order  to  complete  gradient  projection  operation  over  all 
weak  learners  in  the  set  defined  by  Equation  (3.6),  we  simply  enumerate  all  selection  strategies  hs- 
Then,  for  each  selection  strategy  and  each  learning  algorithm  {Ai}f=1,  we  can  use  Equation  (3.8), 
via  the  dataset  in  Equation  (3.9)  to  generate  a  candidate  weak  prediction  function  hP  for  that  se¬ 
lector,  algorithm  pair.  The  pair  hs,hP  can  be  used  to  define  a  structured  weak  learner  h  as  in 
Equation  (3.6),  and  the  best  overall  weak  learner  can  be  selected  as  the  projected  gradient. 


Algorithm  3.1  Structured  Functional  Gradient  Descent 

Given:  objective  12,  set  of  selection  functions  TLs,  set  of  L  learning  algorithms  {A{\[=1,  number 
of  iterations  T,  initial  function  /0. 
for  t  —  1, . . . ,  T  do 
H*  =  0 

for  hs  e  TLs  do 

Create  dataset  D  =  gradient hs )  using  Equation  (3.9). 
for  A  G  (v4i, . . . ,  Al}  do 
Train  hP  —  A(D) 

Define  h  from  hs  and  hP  using  Equation  (3.6). 

n*  =  n*u  {h} 

end  for 
end  for 

Let  V(x)  =  V f(x)£(f(x))  for  all  x. 
ht  =  argmmhen*  Ex[||V(a;)  -  h(x)  ||2] 

Select  a  step  size  at. 

ft  =  ft-  i  +  at.ht 

end  for 


Algorithm  3.1  summarizes  the  structured  version  of  functional  gradient  descent.  It  enumerates 
the  candidate  selection  functions,  hs,  creates  the  training  dataset  defined  by  Equation  (3.9),  and 
then  generates  a  candidate  prediction  function  hP  using  each  weak  learning  algorithm. 

We  will  make  use  of  this  algorithm  in  Chapter  7,  when  we  examine  an  anytime  structured  pre¬ 
diction  approach.  For  more  details  on  applications  of  this  structured  functional  gradient  approach 
and  practical  implementation  concerns,  see  Chapter  7. 


3.2  Stacked  Boosting 

When  training  models  that  incorporate  previous  predictions,  such  as  the  structured  prediction  ap¬ 
proach  discussed  in  Section  3.1,  the  risk  of  overfitting  is  typically  of  large  concern.  In  this  section, 
we  examine  the  use  of  stacking,  a  method  for  training  multiple  simultaneous  predictors  in  order  to 
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simulate  the  overfitting  in  early  predictions,  and  show  how  to  use  this  approach  to  reduce  overfit¬ 
ting  in  functional  gradient  methods  which  re-use  previous  predictions. 

Originally  from  a  different  domain,  the  concept  of  stacking  Wolpert  [1992],  Cohen  and  Car¬ 
valho  [2005]  is  an  approach  for  reducing  the  overfitting  in  a  model  due  to  the  re-use  of  previous 
predictions.  Essentially  this  method  trains  multiple  copies  of  a  given  predictor  while  holding  out 
portions  of  the  dataset,  in  a  manner  similar  to  cross-validation.  Each  predictor  is  then  run  on  the 
held-out  data  to  generate  “unbiased”  predictions  for  use  as  inputs  to  later  predictors,  mitigating  the 
impact  of  overfitting  on  those  predictions.  This  approach  has  proven  to  be  useful  in  structured  pre¬ 
diction  settings  Cohen  and  Carvalho  [2005],  Munoz  et  al.  [2010]  such  as  computer  vision,  where 
it  is  common  to  build  sequential  predictors  which  use  neighboring  and  previous  predictions  as 
contextual  information  to  improve  overall  performance. 

It  is  this  stacking  approach  which  we  will  now  examine  and  extend  to  the  functional  gradient 
setting. 


3.2.1  Background 

The  stacking  method  is  originally  a  method  for  training  feed-forward  networks,  or  sequences  of 
predictors  which  re-use  previous  predictions  as  inputs  at  each  point  in  the  sequence.  In  stacked  for¬ 
ward  training,  a  set  of  predictors  is  trained  using  a  sequential  approach  that  trains  each  successive 
predictor  iteratively  using  the  outputs  from  previously  trained  ones.  This  approach  is  common  in 
structured  prediction  tasks  such  as  vision  where  iterated  predictions  are  used  allow  for  smoothing 
of  neighboring  regions  or  when  the  structure  of  lower  level  features  is  selected  apriori  and  trained 
independently  of  later,  more  complex  feature  representations. 

Assume  we  are  given  a  dataset  V°  of  examples  and  labels  {(in,  yn)}n=o ■  We  model  the  feed¬ 
forward  ensemble  of  K  learners  as  a  sequence  of  predictors  / 1 ....,/  A ,  with  the  output  of  predic¬ 
tor  k  given  as 

4  =  /^n'1), 

with  the  initial  input  x[)n  =  xn. 

Assume  that  for  each  iteration  k,  there  is  a  learning  algorithm  Ak(D)  for  generating  the  pre¬ 
dictor  fk  for  that  iteration,  using  predictions  from  the  previous  iterations  xk^ 1  and  labels  yn.  That 
is,  having  trained  the  previous  functions  f1, . . . ,  fk~  1 .  the  next  predictor  is  trained  by  building  a 
dataset 

Sk  =  {{xkn-\yn}Nn= o, 

and  then  training  the  current  layer 

fk  =  Ak(Sky  (3T0) 

This  method  is  not  robust  to  overfitting,  however,  as  errors  in  early  predictors  are  re-used  for 
training  later  predictors,  while  unseen  test  data  will  likely  generate  less  accurate  predictions  or  low 
level  features.  If  early  predictors  in  the  sequence  overfit  to  the  training  set,  later  predictors  will  be 
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trained  to  rely  on  these  overfit  inputs,  potentially  overfitting  further  to  the  training  set  and  leading 
to  a  cascade  of  failures. 

The  stacking  method  Cohen  and  Carvalho  [2005]  is  a  method  for  reducing  the  overall  gener¬ 
alization  error  of  a  sequence  of  trained  predictors,  by  attempting  to  generate  an  unbiased  set  of 
previous  predictions  for  use  in  training  each  successive  predictor.  This  is  done  by  training  multiple 
copies  of  each  predictor  on  different  portions  of  the  data,  in  a  manner  similar  to  cross-validation, 
and  using  these  copies  to  predict  on  unseen  parts  of  the  data  set. 

More  formally,  we  split  the  dataset  Sk  in  to  J  equal  portions  Sk. ....  Sk.  and  for  each  predictor 
fk  train  an  additional  J  copies  fk.  Each  copy  is  trained  on  the  dataset  excluding  the  corresponding 
fold,  as  in  ./-fold  cross-validation: 

fk  =  Ak(Sk\S!?).  (3.11) 

Each  of  the  copies  is  then  used  to  generate  predictions  on  the  held-out  portion  of  the  data  which 
are  used  to  continue  the  training  by  building  a  dataset  of  the  held-out  predictions: 

SM  =  U/=1  I  (x,v)  6  5‘}  .  (3.12) 

The  predictor  fk  for  the  final  sequence  is  still  trained  on  the  whole  dataset  Sk,  as  in  (3.10) 
and  returned  in  the  final  model.  The  stacked  copies  are  only  used  to  generate  the  predictions  for 
training  the  rest  of  the  sequence,  and  are  then  discarded. 

A  complete  description  of  stacked  forward  training  is  given  in  Algorithm  3.2. 


Algorithm  3.2  Stacked  Forward  Training 

Given:  initial  dataset  S°,  training  algorithms  Ak,  number  of  stacking  folds  J. 
for  k  —  1, . . . ,  K  do 
Let  fk  =  Ak(Sk). 

Split  Sk  into  equal  parts  Sk , ....  Sk. 

For  j  =  1, . . . ,  J  let  fk  =  Ak(S  \  Sk). 

Let  Sfc+1  =  u/=1  {(fk(x),y)  \  (x,y)  E  Sk}. 

end  for 
return 


3.2.2  Stacked  Functional  Gradient  Methods 

Now  we  want  to  adapt  this  stacking  method  to  the  domain  of  functional  gradient  methods.  One 
key  difference  between  the  stacking  approach  used  in  Section  3.2.1  and  the  functional  gradient 
approach  is  that  stacking  was  originally  developed  for  networks  of  predictors  which  use  only  pre¬ 
vious  predictions  as  inputs  at  each  layer,  while  in  the  functional  gradient  method,  we  still  want  to 
retain  the  example  x  as  an  input  in  addition  to  previous  predictions. 


48 


CHAPTER  3.  FUNCTIONAL  GRADIENT  EXTENSIONS 


Algorithm  3.3  Stacked  Functional  Gradient  Descent 

Given:  starting  point  /0,  step  size  schedule  {r/t}J=1,  number  of  stacking  folds  K . 

Split  training  data  X  into  equal  parts  X1, . ,  XK. 

Let  fk> o  =  f0. 
for  t  —  1, . . . ,  T  do 
for  k  —  1, ... ,  K  do 
Let  Xjj  =  X\  Xk. 
uztyl  =  {fk,t-i(x)  \xex£}. 

Let  yk  =  {fk,t-i(x)  |  X  e  xk}. 

Compute  a  subgradient  Vk,t  €  dTZ[fh,t-i\  over  only  points  in  Xk. 

Compute  h*k  =  Proj  (V/.,/,  Ti),  again  only  over  points  in  Xk  and  using  yk  as  previous 
predictions. 

Update  fk:  /M  =  i  -  r)th*k. 

end  for 

Let  ^  =  Ufc  Re¬ 
compute  a  subgradient  Vt  G  dlZ[ft-i\  over  all  points  in  X. 

Compute  h*  =  Proj  (V,  H)  over  all  points  in  X  and  using  y  as  previous  predictions. 

Update  f:  ft  =  ft- 1  -  rjth*. 

end  for 


Consider  the  following  weak  learner  h  which  takes  both  example  and  previous  predictions  as 
inputs: 

h(x,  y)  :  X  x  y  ^  y. 

One  example  is  the  structured  weak  learner  discussed  in  Section  3.1,  and  given  in  Equation  3.6. 
We  can  define  the  final  output  of  a  boosted  ensemble  of  such  weak  learners  as 

f(x)  =  ^2ht(x,yt), 

t 

where  yt  is  given  as  the  prediction  up  to  weak  learner  t: 

t 

Vt  =  ^2ht(x,yi). 

i= 1 

We  want  to  use  the  stacking  method  to  generate  held-out  version  of  the  predictions  for  use  as 
input  when  computing  new  weak  learners.  To  do  this,  we  will  follow  the  same  general  procedure 
as  outlined  above  for  feed-forward  stacking. 

We  will  maintain  the  real  boosted  predictor  /,  along  with  copies  fk  for  each  of  K  folds  of 
the  training  data.  At  training  time,  each  copy  fk  is  trained  on  all  data  except  fold  k,  and  using 
its  own  predictions  as  previous  predictions.  The  true  predictor  /  is  trained  using  all  the  data,  but 
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the  previous  predictions  are  drawn  from  the  outputs  of  each  copy  /*,,  run  on  its  respective  fold  of 
the  data  X^  which  it  was  not  previously  trained  on.  At  test  time,  only  the  predictor  /  which  was 
trained  on  all  data  will  be  used  for  prediction. 

Algorithm  3.3  gives  the  stacked  version  of  functional  gradient  descent.  This  method  can  be 
combined  with  other  functional  gradient  methods  fairly  easily,  such  as  the  structured  functional 
gradient  approach  detailed  earlier,  by  simply  following  the  same  strategy  of  maintaining  K  copies 
of  boosted  predictor  and  using  each  copy  to  compute  held-out  predictions.  Later,  in  Chapter  7,  we 
will  combined  both  of  these  approaches  to  build  a  stacked,  structured  functional  gradient  learner. 
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Part  II 

Greedy  Optimization 
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Chapter  4 

Budgeted  Submodular  Function 
Maximization 


In  this  chapter  we  analyze  the  performance  of  greedy  and  approximately  greedy  algorithms  for 
budgeted,  approximate  submodular  maximization.  We  will  be  using  a  version  of  approximate 
submodularity  which  includes  both  a  multiplicative  and  additive  relaxation  from  the  standard  def¬ 
inition  of  submodularity.  For  this  setting,  we  show  that  greedy  approaches  achieve  approximation 
bounds  with  respect  to  a  subset  of  all  arbitrary  budgets  corresponding  to  the  costs  of  each  succes¬ 
sive  subsequence  selected.  Finally,  we  show  that,  if  an  approximation  bound  is  desired  for  any 
arbitrary  budget,  a  modification  of  the  greedy  algorithm  can  achieve  a  bi-criteria  approximation  in 
both  value  and  time  for  arbitrary  budgets. 

4.1  Background 

In  this  chapter  we  will  be  analyzing  approaches  for  maximizing  positive  set  functions  F  :  2X  — >  M, 
F(S)  >  0  over  elements  X,  where  2X  is  the  power  set  of  X .  We  will  be  building  off  of  a  large  body 
of  work  focusing  on  submodular  functions.  A  function  F  is  submodular  if,  for  all  A  C  B  C  X 

F({x}  \JA)-  F(A)  >  F({x}  U  B)~  F(B). 

An  equivalent  definition,  which  we  will  build  off  of  later  relates  the  gain  in  the  value  of  F  when 
adding  a  whole  set  to  the  gain  when  adding  each  element  individually.  A  function  F  is  submodular 
for  all  S,  A  C  X 

F(S  U  A)-  F(A)  <  Y  F(A  u  ix})  ~  F(A)-  (4-1) 

x£S 

We  will  further  restrict  our  analysis  to  monotone  submodular  functions,  that  is,  functions  F 
such  that  F(A)  <  F{B)  if  A  C  B. 

In  the  budgeted  setting,  every  element  in  X  is  associated  with  a  positive  cost  c  :  X  — >■  M, 
c(x)  >  0.  The  cost  of  a  set  of  elements  S  is  the  modular  function  c(S)  =  Y^xes  c(x)- 
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The  budgeted  monotone  submodular  maximization  problem  is  then  to  maximize  a  set  function 
F  subject  to  a  constraint  B  on  the  total  cost  of  the  set: 

arg  max  F(S)  (4.2) 

c{S)  <  B. 

When  performing  our  analysis,  we  will  be  working  both  with  sequences  of  elements,  e.g.  the 
sequence  of  selection  made  by  a  given  algorithm,  and  sets  of  elements  corresponding  to  particular 
points  in  said  sequences.  Given  a  sequence  S  =  s\, . . .  and  a  given  budget  C,  we  can  define  the 
resulting  set  at  that  budget  to  be  S(C)  =  {si,  •  •  •  *  Sk}  such  that  c(sj)  <  C.  Similarly,  define 
the  set  Sk  to  be  {si, . . . ,  sk},  and  50  =  0. 

As  discussed  in  the  discussion  of  related  work  in  Section  1.3,  previous  work  [Khuller  et  al., 
1999,  Krause  and  Guestrin,  2005,  Leskovec  et  al.,  2007,  Lin  and  Bilmes,  2010]  has  included  a 
number  of  algorithms  and  corresponding  approximation  bounds  for  this  setting.  These  approaches 
range  from  variations  on  the  cost-greedy  algorithm  to  much  more  complex  strategies,  and  have  ap¬ 
proximation  bounds  with  factors  of  |(1  — and  (1  —  |),  extending  the  original  result  of  Nemhauser 
et  al.  [1978]  for  the  unit-cost,  or  cardinality  constrained  case. 

The  key  difference  between  our  analysis  here  and  these  previous  results  is  that  these  results  all 
require  that  the  budget  be  known  apriori.  For  example,  one  of  the  results  of  Krause  and  Guestrin 
[2005]  which  achieves  a  |(1  —  -)  approximation  uses  a  modified  greedy  algorithm  which  selects 
either  the  result  of  the  cost-greedy  algorithm,  or  the  single  largest  element  with  cost  less  than  the 
budget. 

Unfortunately,  for  our  purposes  we  want  a  single,  budget-agnostic  algorithm  which  produces 
a  sequence  of  elements  with  good  performance  at  any  budget.  Approaches  such  as  the  previous 
example  both  target  a  fixed  budget  and  do  not  produce  a  single  sequence  for  all  budgets.  If  the 
budget  is  increased,  the  selected  set  may  change  completely,  whereas  we  want  a  method  such  that 
increasing  the  budget  only  adds  elements  to  the  currently  selected  set. 

As  we  will  discuss  in  Section  4.4,  in  general  it  is  impossible  to  have  a  budget-agnostic  algo¬ 
rithm  which  achieves  approximation  bounds  for  all  budgets,  but  a  small  tweak  to  the  cost-greedy 
algorithm  does  produce  a  sequence  which  achieves  a  bi-criteria  approximation ,  which  approxi¬ 
mates  the  optimal  set  in  both  value  and  in  cost. 


4.2  Approximate  Submodularity 

Unliked  the  submodular  functions  which  we  discussed  in  Section  4.1,  we  want  to  analyze  the 
performance  of  greedy  algorithms  on  functions  which  behave  like  submodular  functions  to  some 
degree,  but  are  not  strictly  submodular.  Building  off  of  the  requirement  given  in  Equation  (4.1), 
Das  and  Kempe  [201 1]  give  a  definition  of  approximate  submodularity  which  uses  a  multiplicative 
ratio  7  e  [0, 1]  which  they  call  the  submodularity  ratio.  In  this  work,  we  extend  this  definition  of 
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approximate  submodularity  to  also  allow  for  an  additive  error  term  <5,  similar  to  the  approximate 
submodularity  definition  utilized  by  Krause  and  Cehver  [2010]. 

Definition  4.2.1  (Approximate  Submodularity).  A  function  F  is  (7 ,  8 )  -approximately  submod- 
ular  if: 

7  [7-4  US)  -  7-4)1  -  6  <  Y,  [7-4  u  M)  -  7-4)1 , 

x£  S 

for  all  S,  A  C  X. 

As  expected,  this  notion  of  approximate  submodularity  also  extends  the  traditional  definition  of 
submodularity  given  in  Equation  (4.1),  with  any  submodular  function  being  (1,  0) -approximately 
submodular.  Further,  for  5  =  0  this  definition  reduces  to  the  one  given  by  Das  and  Kempe  [2011]. 

4.2.1  Greedy  Algorithm  Analysis 


Algorithm  4.1  Greedy  Algorithm 

Given:  objective  function  F,  elements  X 
Let  Qq  =  0. 
for  j  —  1, ...  do 

Let  gj  =  arg  maxTgA> 

Let  Qj  =  Qj- 1  U  {gj}. 
end  for 


c(x) 


We  will  now  analyze  the  standard  greedy  algorithm  (given  in  Algorithm  4.1)  for  the  budgeted 
submodular  maximization  problem,  operating  on  a  set  function  F  that  is  approximately  submodu¬ 
lar  according  to  Definition  4.2.1. 

As  shown  in  Algorithm  4.1,  the  cost-greedy  algorithm  iteratively  selects  a  sequence  G  = 


(gi,...)  using: 


9j 


arg  max 

x&X 


77—1  u  71)  -  77- 1) 

c(x) 


(4.3) 


We  now  present  a  bound  that  shows  that  the  cost-greedy  algorithm  is  nearly  optimal  for  approx¬ 
imately  submodular  functions.  The  analysis  is  a  combination  of  the  cost-based  greedy  analysis  of 
Streeter  and  Golovin  [2008],  generalized  to  handle  the  approximate  submodular  case  as  in  Das  and 
Kempe  [2011].  Similar  to  Krause  and  Golovin  [2012],  we  also  handle  the  case  where  the  greedy 
list  and  optimal  list  are  selected  using  different  budgets. 

First,  we  need  to  adapt  the  approximate  submodularity  definition  given  in  Definition  4.2.1  to  a 
bound  that  also  relates  the  costs  of  the  elements  and  combined  set. 
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Lemma  4.2.2.  If  a  function  F  is  7,  5 -approximately  submodular  then  for  any  A.  S  C  V: 


7  [F(A  US)-  F(A)}  -  5 
c(S) 


<  max 
x£S 


F(A  U  {a:})  —  F(A) 


c{x 


r, 


Proof  By  Definition  4.2.1,  we  have: 

7  [  F(A  US)-  F(^)]  -6<J2  l^-4  u  W)  -  F(-4)] 


x€S 


<  )  max 

— J  x'GS 
x£S 


<  (  max 

x£S 


C{X' ) 

F(AU  {x})  -  F(A) 
c(x) 


c(x) 

J2c(x) 


\x£S 


L 


Dividing  boths  sides  by  c(S)  =  52xes  c(x )  completes  the  proof. 


Now,  we  can  use  this  result  to  bound  the  gap  between  the  optimal  set  and  the  set  selected  by 
the  greedy  algorithm  at  each  iteration. 

Lemma  4.2.3.  Let  Sj  be  the  value  of  the  maximum  in  Equation  (4.3)  evaluated  by  the  greedy 
algorithm  at  iteration  j.  Then  for  all  sequences  S  and  total  costs  C: 


F(S{Q)  <  F(ffi-i)  + 


Csj  +  5 
7 


r, 


Proof  By  Lemma  4.2.2  we  have: 


7[F(g,-iU5(c,)-F(g,.1)]  -S 

c{S(c)) 


<  max 
x£S(c) 


F(Sj- 1 


u  w)  -  ngj-i) 
c(x) 


< 


By  monotonicity  we  have  F(S{c))  <  F(Qj_i  U5(c)),  and  by  definition  c(S(c))  <  C  giving: 

F(Sio)  <  F(ffi-i)  +  Ulfl. 


L 
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Now,  using  Lemma  4.2.3,  we  can  derive  the  actual  approximation  bound  for  the  greedy  al¬ 
gorithm.  Like  previous  results  [Das  and  Kempe,  2011],  for  a  (7,  5) -approximately  submodular 
function,  this  bound  includes  the  multiplicative  term  7  in  the  multplicative  approximation  term, 
but  has  an  additional  additive  term  dependent  on  the  additive  term  5. 

Theorem  4.2.4.  Let  G  =  {  (j\ , . . .)  be  the  sequence  selected  by  the  cost- greedy  algorithm.  Fix 
some  K  >  0.  Let  B  =  c(di)-  Let  F  be  (7,  5) -approximately  submodular  as  in  Defini¬ 
tion  4.2.1.  For  any  sequence  S  and  total  cost  C, 

F(Sm)  >  (r (S<C))  -  0  • 

Or  more  loosely: 

F(Sm)>  (l-e-1*)  F(SlQ)-s£. 


r 


By  definition  of  Sj+i’. 


t-tsj. |-i  |  S 


Proof.  Define  Aj  =  F(S{q)  -  F^f  -  By  Lemma  4.2.3,  F(S{C))  <  F^f  +  +  s.. 


A,< 


Csj+ 1  C  (  A  ,-  -  A 


Rearranging  we  get  AJ+]  <  Aj(l  —  r(r,jf])'!).  Unroll  to  get 


7  7 

>j+i 

c 


c(9j 


K 


Ak  a  a0  J  1 
o= 1 


c(9jh 

c 


Given  that  B  =  ^7 f= ,  c(gj),  this  is  maximized  at  c{gj)  =  Substituting  in  and  using  the 


fact  that  (l  —  J7) A  < 


7 


F(S{C})  ~  ~  ~  F(Qk)  -  Ak  <  A0  (  1  -  7 — — 


B  1 


I< 


<  (f  (5(C))  -  0  e'A 


orF(<7(B))  >  (l-e-^«)  (F(S(c))-4). 
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Since 


-  <  we  can  also  write  this  as 

7  C  ’ 


F(5lB))  >  (1  -  e~A  (F(Slc)))  -  ' S§. 


L 


4.3  Approximate  Greedy  Maximization 

In  many  cases,  it  is  desirable  to  use  an  algorithm  that  does  not  actually  implement  the  greedy 
strategy,  but  instead  is  approximately  greedy.  That  is,  the  algorithm  attempts  to  select  an  element 
that  will  significantly  improve  the  value  of  the  selected  sequence,  but  does  not  always  select  the 
item  that  maximizes  the  greedy  gain  given  in  Equation  (4.3). 

For  example,  this  type  of  algorithm  is  useful  when  searching  over  the  entire  set  X  for  the 
maximum  gain  is  prohibitively  expensive,  but  a  reasonably  good  element  can  be  selected  much 
more  efficiently.  Other  examples  include  settings  where  the  submodular  function  F  is  only  able  to 
be  evaluated  at  training  time,  and  a  predictor  is  trained  using  contextual  features  to  approximate 
F  for  use  at  test  time  [Streeter  and  Golovin,  2008,  Ross  et  al.,  2013].  A  final  example  is  the 
Orthogonal  Matching  Pursuit  algorithm  which  we  will  examine  in  Chapter  5. 

We  now  give  a  specific  characterization  of  what  it  means  to  be  approximately  greedy  so  we  can 
further  analyze  these  algorithms. 

Definition  4.3.1  (Approximately  Greedy).  Let  be  the  set  of  elements  selected  by  some 
algorithm  A  through  j  —  1  iterations.  Given  a  set  function  F,  we  say  that  A  is  approximately 
greedy  if  for  all  j  there  exists  constants  ay  G  [0,1]  and  ffi  >  0  such  that  g'r  the  element  selected 
by  A  at  iteration  j,  satisfies: 

p(sy  i  u  {«,'}>  -  ^  a,  [nay,  u  w)  -  fis'J]  -  ft 

— —  A  max  — -  . 


For  the  special  case  when  there  exists  a  <  aj  and  f3  >  if  for  all  j  for  some  psuedo-greedy 
algorithm  A,  we  say  that  A  is  (a,  /3) -approximately  greedy. 

Extending  the  previous  approximation  result  for  the  greedy  algorithm,  we  can  get  a  similar 
bound  for  approximately  greedy  algorithms  applied  to  approximately  submodular  function  opti¬ 
mization.  We  first  need  to  generalize  Femma  4.2.3  to  also  include  the  additive  and  multiplicative 
error  introduced  by  the  approximately  greedy  algorithm. 
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Lemma  4.3.2.  Let  s'  be  the  value 


ns'j-i  U  {g'3})  -  F(Q'j_1) 
c(g') 

for  element  g[}  selected  by  an  approximately  greedy  algorithm  according  to  Definition  4.3.1. 
Then  for  all  sequences  S  and  total  costs  C: 


F(S{C))  <  F(Qj_1)  + 


afi 


+ 


«i7 


n 


Proof.  By  Lemma  4.2.2  we  have: 


a 


7[F(£'_1US(c>)-F(£'„1)]  -5  £ 


c0%7>) 


—  ’-f-  <  ay  max 

C  x£S(c) 


F(G'  U  {*})  -  F{Q') 


<  max 

x£S{c) 

<S> 


a.i 


c(x ) 

[Fiffi- 1  U  M)  -  F(g'_,)]  -  D: 

c(x) 


ii 

C 


By  monotonicity  we  have  F(S(c))  <  F(Gj-i  U5(Cj),  and  by  definition  c(S(c))  <  C  giving: 

F(S(C))  <  F(5'-i)  +  ^  +  4^  +  i 

J  aj  7  ay7  7 


L 


Now,  we  can  reproduce  the  same  argument  as  used  for  analyzing  the  greedy  algorithm,  but 
with  the  previous  lemma  as  a  starting  point. 

Theorem  4.3.3.  Let  G'  =  (g[ , . . .)  be  the  sequence  selected  by  an  (a,  /?)- approximately  greedy 
algorithm  as  in  Definition  4.3.1.  Fix  some  K  >  0.  Let  B  =  c(g'i).  Let  F  be  (7,<5)- 

approximately  submodular  as  in  Definition  4.2.1.  For  any  sequence  S  and  total  cost  C, 

F(Sm)  >  (1  -  e-^i)  (F(S<C)))  -  -  asf 


r 

Proof  Define  Aj 


F(S(o)  ~F(Qj).  By  Lemma  4.3.2,  F(S{C))  <  F(Gj)  + 


Csj-\-\  ■  Pj-\- 1  I  S_ 
&j+ 17  «j+i7  7* 
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By  definition  of  .sJ+1 : 


A,-  <  _|_  A/+1  +  ^  ~  Ai+i  ^  +  Pj+ 1  + 


aj+i7  «i+i7  7  «i+i7  V  c(9j)  )  «i+i7  7 


Rearranging  we  get  AJ+1  <  Aj(l  — 


c(9j+l)«j+l7\  ,  ft'+ic(gj+i)  ,  aj+iSc(gj+1) 


C 


+ 


c 


+ 


c 


Unroll  to  get 


A k  <  A0  |  1 

\j= i 

Using  1  —  <  1, 


c(9j)ajl\  ,  (  tt  ,  c(9i)ao\  ( Pjcfa)  otjScigj) 


C 


e  n 1 

3= 1  V=j+1 


C 


c 


c 


Ar  <  J  a0  +  ^2  ^  i  - 


c  ^  c 

3  =  1  j=l 


u=i 


c 


Let  a  <  aj.  Given  that  B  =  J2?=ic(9j)’  lh's  is  maximized  at  c(gj)  =  A.  Substituting  in 


/  \K 

and  using  the  fact  that  (1  —  J?)  < 


F  (S(C))  -  F(Sk)  =  A*  <  I  Ao  +  ^ftAif +aA)  (l-a7-B 


/=i 

K 


C  K 


<  (F(5(o)  +  Eft^+“^ 

i=i 


orF(e(B))>  (l-e-“^«) 

Since  (l  —  e_cr)E  <  1,  we  can  also  write  this  as 


K 


HS(B))  >  (1  -  e-^«)  (F(S{  o,))  -  ]T  A  A>>  _  aA. 


3= 1 


c 


c 


Now,  if  7  >  /A  for  all  j. 


F(Q<B>)  >  (1  -  e-“+)  (F(S<c>))  -  /3B  -  <rfB. 


As  expected  from  the  greedy  algorithm  analysis,  the  multiplicative  and  additive  terms  (a,  /3)  get 
incorporated  in  the  same  manner  as  the  respective  (7,  7)  terms  from  the  approximate  submodularity 
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bound. 

This  bound  incorporates  a  number  of  features  of  previous  results  as  well.  For  example,  the 
corresponding  bound  from  the  work  of  Das  and  Kempe  [2011]  for  the  Orthogonal  Matching  Pursuit 
algorithm  in  the  unit-cost  case  has  the  same  a 7  term. 

Other  work  on  no-regret  learning  of  approximators  for  the  submodular  function  F  [Streeter 
and  Golovin,  2008,  Ross  et  al.,  2013]  incorporates  the  additive  term 


X> 


c(flj) 

C 


where  / 3j  is  the  error  made  by  the  no-regret  learner  at  each  iteration.  This  same  term  is  seen  in  the 
previous  proof,  and  is  only  simplified  using  f3  <  /3j.  The  same  analysis  above  could  be  used  to 
extend  similar  no-regret  analyses  to  the  approximately  submodular  setting,  in  the  same  manner  as 
this  previous  work. 


4.4  Bi-criteria  Approximation  Bounds  for  Arbitrary  Budgets 

In  the  previous  sections,  we  derived  bounds  that  hold  only  for  the  budgets  B  which  correspond 
to  the  budgets  at  which  the  algorithm  adds  a  new  element  to  the  sequence.  In  many  settings,  this 
guarantee  can  be  poor  for  the  list  selected  by  the  greedy  algorithm  in  practice.  For  example,  if  the 
algorithm  selects  a  single,  high-cost,  high-value  item  at  the  first  iteration,  then  the  smallest  budget 
the  guarantee  holds  for  is  the  cost  of  that  item. 

As  discussed  in  Section  4.1,  there  are  many  approaches  that  can  obtain  guarantees  for  any  ar¬ 
bitrary  budget  B,  but  unfortunately  these  algorithms  do  not  generate  a  single  common  sequence 
for  all  budgets.  Because  we  ultimately  want  anytime  or  anybudget  behavior,  we  would  like  similar 
guarantees  for  a  budget  agnostic  algorithm.  Unfortunately,  as  we  will  show  shortly,  a  guarantee 
that  has  the  same  form  as  previous  ones  is  not  possible  for  arbitrary  submodular  functions.  In¬ 
stead,  in  this  section  we  will  develop  an  algorithm  and  corresponding  bound  that  gives  a  bi-criteria 
approximation  in  both  value  and  budget.  Such  a  bound  will  have  the  form 

F{Q(b))  >  (1  -  c\)F{S,b_k)  -  —  (4.4) 

'  c2 

Here  we  have  the  standard  (1  —  c\) -approximation  in  value  when  compared  to  any  arbitrary  se¬ 
quence  S,  but  we  also  have  a  c2 -approximation  in  budget,  that  is,  in  order  to  be  competitive  with  a 
given  sequence  we  need  to  incur  c2  additional  cost. 

We  will  now  show  the  inherent  difficulty  in  obtaining  good  performance  from  a  budget  agnos¬ 
tic  algorithm  which  generates  a  single  sequence,  and  demonstrate  the  necessity  of  the  bi-criteria 
approximation  given  above.  Consider  the  following  budgeted  maximization  problem: 

X  =  {1,2,...},  c{x)  =  x 
F(S)  =  £  e*. 

x£S 


(4.5) 
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We  can  use  this  problem  to  illustrate  the  inherent  difficulty  in  generating  single  sequences  that 
are  competitive  at  arbitrary  budgets  B.  This  problem  is  in  fact  a  modular  optimization,  and  fur¬ 
thermore,  has  a  very  simple  optimal  solution  of  always  selecting  the  single  largest  element  that 
fits  within  a  given  budget.  As  the  next  result  shows,  however,  even  achieving  a  bound  derived  for 
submodular  functions  is  difficult  unless  the  cost  approximation  is  fairly  loose. 

Theorem  4.4.1.  Let  A  be  any  algorithm  for  selecting  sequences  A  =  (ai, . . .).  The  best  bi- 
criteria  approximation  A  can  satisfy  must  be  at  least  a  4-approximation  in  cost  for  the  sequence 
described  in  Equation  (4.5).  That  is,  there  does  not  exist  a  C  <  4  such  that,  for  all  B  >  cmi„  and 
all  sequences  S, 

F(A{B))  >  (l  -  i)  F(S{§)) 


^ Proof.  First,  by  construction  of  the  problem,  it  is  clear  that  the  optimal  set  for  a  given  (integral) 
budget  B'  is  to  select  the  largest  element  x  <  B'  for  a  value  of  eB  .  Furthermore,  because 


yB'- 1 
2—dx= i 


e 


B' 


< 


1 


e-  1’ 


the  only  way  to  be  a  (1  —  ^-approximation  in  value  for  all  sets  S(b>)  is  to  have  selected  an 
element  x  >  B' . 

Consider  the  sequence  A  at  some  element  j.  Let  the  cost  c(Af)  =  b.  Let  the  largest  element 
in  Aj  be  some  function  of  b,  f(b)  with  value  e^b\ 

Now  consider  the  next  element  oJ+].  To  maintain  the  property  that  A  is  a  C- approximation 
in  cost,  c(Aj. |_i)  is  at  most  Cf(b),  which  implies  that  c(cij+ 1)  <  Cf(b)  —  b.  In  order  for  the 
sequence  to  continue  extending  itself  to  arbitrary  large  budgets,  the  ratio  between  the  cost  of  the 
sequence  and  the  largest  element  in  the  sequence  must  be  increasing,  giving 

b  cm 

m  -  Cf(b)  —  b' 

Rearranging  and  using  the  fact  that  all  terms  are  positive  gives 

b2  -  Cbf(b )  +  Cf{b )2  >  0. 

The  above  inqeuality  only  holds  when  \/~C  >  2,  or  C  >  4,  proving  the  theorem.  ■ 


As  an  aside,  the  argument  in  the  proof  above  also  shows  that  the  optimal  single  sequence  A, 
i.e. the  sequence  for  which  the  argument  above  is  tightest,  will  output  elements  aj  with  cumulative 
cost  c(Aj)  =  b  such  that  c(aj )  =  f(b)  =  J=.  In  the  case  of  the  tightest  achievable  bound  when 
C  —  4,  this  corresponds  to  selecting  an  element  at  every  iteration  that  roughly  doubles  the  current 
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cost  of  the  list.  We  will  examine  an  algorithm  (Algorithm  4.2)  which  demonstrates  exactly  this 
doubling  behavior. 

Now  that  we  have  an  upper  bound  on  the  best  cost- approximation  any  single  sequence  algo¬ 
rithm  can  obtain,  we  now  present  an  algorithm  which  does  satisfy  the  bi-criteria  approximation  in 
Equation  (4.4),  with  a  cost  approximation  factor  of  6.  1 

Algorithm  4.2  presents  a  doubling  strategy  for  selecting  a  sequence  of  elements  by  effectively 
doubling  the  space  of  elements  that  can  be  added  at  each  iteration.  At  the  first  iteration,  the  al¬ 
gorithm  selects  elements  less  than  some  minimum  cost  cm in.  For  every  following  iteration,  the 
algorithm  selects  from  all  elements  with  cost  less  than  the  total  cost  of  the  items  selected  so  far,  at 
most  doubling  the  total  cost  of  the  sequence. 


Algorithm  4.2  Doubling  Algorithm 


c(x) 


Given:  objective  function  F,  elements  X,  minimum  cost  cn 

Let  Gi  =  arg  max 

x£X,  c(x)<cm 

Let  G\  =  {5-1}. 
for  j  —  2, ...  do 

Let  gj  =  arg  max 


F(^_iU{x})-F(0j_i) 


xex\Gj_  1,  c(x)<c(Gj- 1) 

Let  Qj  =  Gj~i  U  {gj}- 

end  for 


c(x ) 


This  algorithm  allows  for  a  bi-criteria  approximation  for  arbitrary  approximately  submodu- 
lar  maximization  problems,  as  long  as  the  doubling  algorithm  doesn’t  get  stuck  at  any  iteration. 
The  following  definition  just  outlines  the  conditions  that  allow  the  doubling  algorithm  to  suceed, 
namely  that  the  algorithm  can  always  continue  to  select  new  elements  at  every  iteration. 

Definition  4.4.2.  Let  G  =  {(j\ ....)  be  the  sequence  selected  by  the  doubling  algorithm.  The  set 
X  and  function  F  are  doubling  capable  if  at  every  iteration  j,  the  set 

{x  |  x  E  X  \  Gj_  1,  c(x)  <  c(Gj- 1)} 


is  non-empty. 

We  will  assume  that  this  definition  holds  for  the  rest  of  the  analysis.  In  order  to  prove  the 
bi-criteria  approximation,  we  first  need  the  following  lemma,  describing  the  behavior  of  the  total 
cost  of  the  subsets  selected  by  the  doubling  algorithm. 


'We  conjecture  that  the  cost  approximation  factor  for  Algorithm  4.2  is  actually  4,  but  are  not  able  to  prove  it 
directly  using  the  analysis  here. 
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Lemma  4.4.3.  Let  G  =  (gi, ...)  be  the  sequence  selected  by  the  doubling  algorithm.  Fix  some 
B  such  that  cmjn  <  B  <  c(X).  There  exists  some  K  such  that  ^  c(di)  —  B. 

Proof  Consider  the  largest  K  such  that  c(gf)  <  B.  Examine  the  just  element,  (Jk+\- 
Both  gx  and  gx+i  must  exist  because  X  is  doubling  capable  according  to  Definition  4.4.2.  By 
construction, 

K+ 1 

Y  C(di)  >  B. 

i=  1 

Because  c(gK+i )  <  c(Gk )  we  know  that 

K  A'+l 

2  Y  c(&)  -  Y c (&)  - 

i—  1  i=  1 


L 


completing  the  proof. 


Using  the  above  lemma,  we  can  now  give  the  bi-criteria  approximation  bound  for  Algorithm  4.2 
for  any  given  budget.  The  bound  gives  a  6  approximation  in  cost  and  the  same  approximation  in 
value  as  the  previous  greedy  result  in  Theorem  4.2.4. 


Theorem  4.4.4.  Let  G  =  (c/i, . . .)  be  the  sequence  selected  by  the  doubling  algorithm  (Algo¬ 
rithm  4.2).  Fix  some  B  >  cmin.  Let  F  be  (7, 5) -approximately  submodular  as  in  Definition  4.2.1. 
For  any  sequence  S, 


F(Sm)>  (l-e-y  F(S( !>)-& 


^ Proof.  Clearly,  if  B  >  c(X),  the  theorem  trivially  holds  for  G{b)  —  X ■ 

If  B  <  c(X),  using  Lemma  4.4.3,  we  know  that  there  must  be  some  K  such  that  I  < 
Et.cte)  <  B.  Similiarly  there  must  exist  some  k  such  that  f  <  E«  cfe)  <  f  • 

Consider  the  sequence  G'  =  (gk+i,  •  •  • ,  gx)-  Let  Cf-  =  Qk  U  {gk+ 1, . . . ,  gf\.  We  can  derive  a 
modified  version  of  the  bound  in  Lemma  4.2.3  that  gives: 

CV,  +  5 

F(S(f>)  <  F{Q>_f)  +  , 

where  s'-  is  the  maximum  gain  at  iteration  j+k  of  Algorithm  4.2.  This  holds  because  c(Gk)  >  f , 
implying  that  the  maximum  at  iteration  k  +  j  used  to  calculate  s'-  is  over  a  superset  of  the 

elements  in  S,b\. 

\  6  / 

Now,  using  that  modified  version  of  Lemma  4.2.3,  we  can  re-apply  the  same  argument  from 
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Theorem  4.2.4  and  show  that 


7 


mf)) - - Fm <  ii1- 


7 


K 


\j=k 


6c(^)7 

B 


By  construction  of  k  and  K ,  Ylf=k  c(dj )  >  so  we  can  simplify  this  to 


B 


5 


7 


-F(gK)<lF(S<4))--  (1-717 


5 


’(f)' 


7 


Jl 


A 


after  which  bounding  the  (1  —  ^)h  term  and  rearranging  gives  the  final  bound: 


F(S,B>)  >(1-  e-T)  (f (5<f ,))  -  5. 


The  same  basic  arguments  as  above  can  be  also  be  used  to  show  that  an  approximately  greedy 
algorithm,  when  modified  with  the  selection  strategy  of  the  doubling  algorithm,  will  also  give  a 
bi-criteria  approximation  for  any  budget.  The  cost  approximation  will  still  be  a  factor  of  6,  and  the 
value  approximation  will  be  the  same  as  in  Theorem  4.3.3. 


CHAPTER  4.  BUDGETED  SUBMODUEAR  FUNCTION  MAXIMIZATION 


Chapter  5 

Sparse  Approximation 


In  this  chapter  we  will  examine  the  sparse  approximation  or  subset  selection  problem.  We  will 
show  that  the  greedy  and  approximately  greedy  algorithm  analysis  of  the  previous  chapter  can  be 
applied  here,  and  will  derive  corresponding  approximation  results  for  this  setting.  In  constrast  to 
previous  work  on  similar  analyses,  we  derive  bounds  that  depend  primarily  on  factors  which  place 
small  weights  on  the  optimal  subset  of  features,  as  opposed  to  previous  work,  where  the  approxi¬ 
mation  bounds  depend  primarily  on  factors  related  to  the  geometry,  or  orhogonality,  of  the  features. 
We  also  present  novel,  time-based  versions  of  classic  feature  or  subset  selection  algorithms,  and 
show  that  for  the  budgeted  feature  selection  problem,  these  approaches  significantly  outperform 
approaches  which  do  not  consider  feature  cost. 


5.1  Background 


Given  a  set  of  variables  or  features  Xt  e  X  and  a  target  variable  Y,  the  sparse  approximation 
problem  is  to  select  a  subset  of  the  variables  V  C  X  that  minimizes  the  reconstruction  error 

minE[^(Y  -  wTXD)\ 

W  Z 

where  X$  =  [Xt\Xt  e  <S].  Typically  the  selection  is  done  with  respect  to  some  constraint  on  the 
selected  V,  such  as  a  cardinality  constraint  \V\  <  B. 

This  problem  is  commonly  framed  in  the  literature  as  a  constrained  loss  minimization  problem 
of  the  loss  function 

f(w)  =  E[(Y-wTX)%  (5.1) 

where  the  constaint  is  designed  to  induce  sparsity  on  the  weight  vector  w. 

The  sparse  approximation  problem  can  then  be  written  as  a  loss  minimization  with  respect  to  a 
constraint  on  the  number  of  non-zero  entries  in  w: 


min  f(w) 

W 

Hlo  <  B 


(5.2) 
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Algorithm  5.1  Forward  Regression 


Given:  elements  X,  target  y 
Define  F  as  in  Equation  (5.3) 
Let  T?0  =  0. 


for  j  =  1, ...  do 

Let  x*  =  arg  maxreA. 
Let  Vj  =  i  U  {a:*}. 

end  for 


c(x) 


where  ||||0  is  the  “0-norm”,  or  number  of  non-empty  elements  in  w.  The  selected  elements  in 
D  then  simply  correspond  to  the  non-zero  indices  selected  in  the  optimal  solution  to  the  above 
constrained  problem. 

Another  way  to  re-write  the  sparse  approximation  objective  is  as  a  set  function  F{S ): 


F(S) 


\e\Y2}  -  mill  1e[(V  -  wTXs)2} 


max  bToW  — 
uiSRl5! 


-wTCsw, 


(5.3) 

(5.4) 


where  b  is  a  vector  of  covariances  such  that  bi  =  Cov(A3,  Y)  and  C  is  the  covariance  matrix  of 
the  variables  Xt,  with  Cij  =  Cov(Xj,  X:j ) .  Lurthermore,  C's  is  the  subset  of  rows  and  columns  of 
C  corresponding  to  the  variables  selected  in  S  and  bs  is  the  equivalent  over  vector  indices  of  the 
vector  b. 

The  sparse  approximation  problem  is  then  the  same  as  the  monotone  maximization  problem 
over  the  set  function  F  given  in  Equation  (4.2),  as  studied  in  the  previous  chapter.  Extending 
this  problem  to  the  budgeted  setting  from  the  previous  chapter  as  well,  each  feature  also  has  an 
associated  cost  c(3Q)  and  the  goal  is  to  select  a  subset  V  such  that  c(V)  =  c(x )  <  B-  F°r 

the  coordinate-based  version  of  the  problem  in  Equation  (5.2),  this  can  be  replaced  with  a  weighted 
equivalent  of  the  zero-norm. 

For  our  analysis,  we  will  stick  to  the  set  function  maximization  setting  analyzed  in  Chapter  4, 
but  these  representations  are  all  equivalent. 


5.1.1  Algorithms 

Two  approaches  to  solving  this  problem  which  we  will  analyze  here  are  two  cost-greedy  ver¬ 
sions  of  the  existing  the  Forward  Regression  [Miller,  2002]  and  Orthogonal  Matching  Pursuit  Pati 
et  al.  [1993]  algorithms.  The  cost-aware  Forward  Regression  algorithm,  given  in  Algorithm  5.1 
simply  selects  the  next  variable  x  which  maximizes  the  gain  in  objective  F,  divided  by  the  cost 
c(x).  This  is  equivalent  to  the  standard  cost-greedy  algorithm  (Algorithm  4.1)  for  maximizing  the 
set  function  F  given  in  Equation  (5.3). 
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Algorithm  5.2  Orthogonal  Matching  Pursuit 

Given:  elements  X,  target  y 

Define  F 

as  in  Equation  (5.3) 

Let  V0  = 

0. 

for  j  =  1 

. .  do 

Let  w* 

=  arg  min,a,  E[(U  -  wTXVj_1 )2] 

Let  x* 

E[(Y-w*tXv  x ]2 

=  arg  rnaxxgA>  c(x) 

Let  Vj 

1 

C 

* 

end  for 

The  Orthogonal  Matching  Pursuit  (OMP)  algorithm  [Pati  et  al.,  1993],  modified  to  handle 
feature  costs,  given  in  Algorithm  5.2  is  a  more  specialized  algorithm  for  optimizing  set  functions 
F  that  correspond  to  an  underlying  loss  optimization.  The  classic  OMP  algorithm  for  optimizing 
a  loss  function  first  computes  a  gradient  of  the  loss  function  given  the  currently  selected  variables, 
then  selects  the  next  variable  which  maximizes  the  inner  product  with  the  computed  gradient.  In  the 
sparse  approximation  setting,  this  corresponds  to  computing  the  current  residual  Z  =  Y  —  w*T Xv 
given  the  currently  selected  elements  V,  and  then  selecting  X,  which  maximizes  (Cov(3Q,  Z))2, 
or  E[(y  —  iF'  X'Dj  ,  )rX,]2.  To  make  this  algorithm  cost-aware,  we  simply  augment  the  greedy 
maximization  of  the  gradient  term  to  be  discounted  by  the  feature  cost. 

For  the  coordinate-based  version  of  the  sparse  approximation  problem  (Equation  (5.2)),  this  can 
also  be  viewed  as  performing  a  coordinate  descent  over  the  weight  vector  w,  as  the  maximization 


x  =  arg  max 

xdX 


n(Y  -w^Xv^xW 
c(x) 


is  equivalent  to  selecting  the  dimension  of  w  with  the  corresponding  steepest  gradient. 

In  previous  work  [Das  and  Kempe,  2011],  these  algorithms,  applied  to  the  sparse  approxi¬ 
mation  problem,  have  been  analyzed  in  the  context  of  the  submodular  optimization  setting.  This 
previous  work  has  shown  that  the  sparse  approximation  problem  is  in  fact  approximately  submod¬ 
ular  (Definition  4.2.1),  and  that  the  submodular  optimization  analysis  shown  in  the  last  chapter  can 
be  directly  applied  to  these  algorithms  and  the  sparse  approximation  problem. 

Specifically,  they  show  that  the  sparse  approximation  problem  is  (Amin(C),  0) -approximately 
submodular,  where  Amin  is  the  minimum  eigenvalue  of  the  covariance  matrix  C,  which  captures 
the  degree  to  which  the  variables  Xt  are  non-orthogonal.  Additionally,  they  show  that  the  OMP 
algorithm  is  also  (Amin(C),  0) -approximately  greedy  in  this  setting. 

Let  T>(b)  be  the  set  selected  by  forward  regression  for  some  budget  B,  and  S(c)  be  the  optimal 
set  of  features  for  some  other  budget  C.  Using  the  results  in  Theorem  4.2.4  this  previous  work 
gives  a  bound  of 
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on  the  performance  of  the  forward  algorithm  compared  to  optimal  performance.  Similarly  for  the 
OMP  algorithm  we  find  an  approximation  factor  of 

fl  —  g  — -Wi(CF|y\ 


(a)  (b) 

Figure  5.1:  (a)  Non-submodularity  in  features-  that  is  when  the  sum  is  better  then  the  parts-  occurs  when 
two  correlated  features  can  be  combined  to  reach  a  target  that  each  poorly  represents  on  their  own.  This 
occurs  often  in  practice;  however,  it  takes  very  large  weights  for  the  combination  of  features  to  be  better 
then  the  features  taken  alone,  which  is  disallowed  by  regularization,  (b)  Illustration  of  the  approximation 
guarantee  for  a  simple  problem  with  two  highly  correlated  features,  as  a  function  of  the  correlation,  or  angle, 
between  the  two  features.  We  illustrate  the  bound  for  the  completely  spectral  case  [Das  and  Kempe,  201 1], 
and  for  the  same  problem  with  regularization  of  A  =  0.5  using  the  bound  presented. 

In  many  settings,  however,  these  geometric  factors  can  approach  their  worst  case  bounds.  Just 
two  highly  correlated  features  can  cause  the  minimum  eigenvalue  of  C  to  be  extremely  small.  For 
example,  in  future  chapters,  we  will  be  applying  this  same  analysis  to  sets  of  “features”  X,  that  are 
the  outputs  of  a  set  of  weak  predictors,  for  example  the  outputs  of  all  decision  trees  defined  over  a 
set  of  training  points.  In  this  setting  the  bounds  are  extremely  weak,  as  minor  changes  to  a  given 
weak  predictor  will  produce  another  highly  correlated  “feature”  in  the  set  X . 

Intuitively,  the  geometric  factors  that  previous  bounds  have  relied  on  are  needed  for  analysis 
because  as  the  variables  involved  diverge  from  orthogonality  and  become  more  dependent,  the 
performance  of  multiple  vectors  combined  together  can  vastly  outperform  the  performance  of  each 
vector  in  isolation.  This  gap  between  the  combined  gain  and  individual  gain  for  a  set  of  elements 
makes  greedy  selection  perform  arbitrarily  poorly  compared  to  combinatorial  enumeration,  as  the 
bounds  in  Chapter  4  and  previous  results  [Krause  and  Cehver,  2010,  Das  and  Kempe,  2011]  show. 
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The  inherent  difficulty  in  the  subset  selection  problem  when  the  variables  are  highly  dependent 
is  due  to  a  simple  fact  that  can  be  illustrated  geometrically:  when  two  vectors  are  nearly  parallel, 
combining  them  together  with  large  weights  can  produce  new  vectors  that  are  nearly  orthogonal 
to  either  of  the  two  original  vectors.  This  in  turn  can  cause  two  variables  to  have  small  individual 
gains,  while  still  having  large  combined  gains  and  in  turn  weakening  the  approximate  submodu¬ 
larity  guarantees  of  the  problem. 

These  problems  with  non-orthogonality  only  arise  in  the  presence  of  large  weighted  combina¬ 
tions  of  the  underlying  variables.  To  analyze  the  impact  that  the  magnitude  of  the  weights  has  on 
the  resulting  approximation  bounds,  we  will  now  analyze  two  regularized  variants  of  the  sparse 
approximation  problem.  Using  regularization  we  can  reduce  the  impact  that  large  weight  vectors 
can  have  on  the  gain  of  any  given  subset,  thereby  improving  the  approximate  submodularity-based 
bounds. 

Furthermore,  it  is  typically  beneficial  in  practice  to  use  some  amount  of  regularization,  to 
avoid  overfitting  and  increase  the  robustness  of  a  selected  set.  In  the  abscence  of  exponentially 
large  amounts  of  training  data,  a  small  amount  of  regularization  would  be  warranted  anyway,  so 
any  improvement  in  the  theoretical  guarantees  of  the  algorithm  is  just  another  added  benefit. 


5.2  Regularized  Sparse  Approximation 

The  first  approach  to  regularization  we  will  analyze  is  a  Tikhonov  regularized  version  of  the  prob¬ 
lem  which  will  directly  penalize  the  gain  of  large  weight  vectors.  This  regularized  version  of  the 
sparse  approximation  problem  is  given  as 

F{S)  =  ^E[Y2]  -  min  ^E[(Y  -  wTXs)2  +  A wTw]  (5.5) 

2  uieRi5!  2 

=  max  bgW  —  -wT(Cs  +  A I)w,  (5.6) 

weRi5!  2 

where  b  and  C  are  the  covariance  vector  and  matrix  as  defined  previously. 

Just  as  in  previous  work  [Das  and  Kempe,  2011]  for  the  unregularized  case,  we  can  show  that 
this  regularized  version  of  the  sparse  approximation  problem  is  approximately  submodular  as  given 
in  Definition  4.2.1. 

To  do  this,  we  will  first  need  to  consider  a  few  lemmas  which  allow  us  to  relate  our  result  to 
the  spectral  properties  used  in  previous  work  [Das  and  Kempe,  2011]. 

Let  Cg  be  the  covariance  matrix  of  the  residual  components  of  the  set  S  with  respect  to  the 
set  A.  Specifically,  if  we  define  Res  (A*.  A)  to  be  the  portion  of  X,  orthogonal  to  the  variables 
selected  in  A,  then  Cg  is  the  covariance  matrix  of  the  variables  X[  =  Rcs(X,-,  A)  for  i  e  S. 
The  first  lemma  relates  the  eigenvalues  of  the  covariance  matrix  of  residuals,  Cg  to  the  equivalent 
matrix  that  will  appear  in  the  analysis  of  the  regularized  problem. 
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Lemma  5.2.1.  Given  sets  of  variables  S  and  A,  let  Cg  be  the  covariance  matrix  for  the  residual 
components  ofS  with  respect  to  A,  and  Cg'  such  that 

r<A  _  /~i  /~t  / — r — 1/ — » 

—  <~S  ~  L-SAL'A  L-AS 

cf  =  Cs-Cs A  (C A  +  A/)-1  C AS, 


for  some  X.  Then 


X  miniCf)  >  A  min(C£) 


where  Xmtn(C)  is  the  minimum  eigenvalue  of  C. 


r, 


Proof.  For  all  vectors  x,  we  have  that 


xtCsa  ( CA  +  XI)  1  CAsx  <  xtCsaCa1Casx, 


which  implies  that 

xTCg'x  >  xTCgX 

for  all  x. 

Given  that  Amin(C)  =  min xtx=1  xtCx,  we  have  that 


L 


completing  the  proof. 


The  next  lemma  is  taken  directly  from  previous  work,  and  bounds  the  smallest  eigenvalue  of 
Cg  in  terms  of  that  of  the  whole  covariance  matrix  C.  For  more  details  on  the  proof  of  this  lemma, 
we  refer  the  reader  to  the  previous  work. 


Lemma  5.2.2  (Lemmas  2.5  and  2.6  from  [Das  and  Kempe,  2011]).  Given  sets  of  variables  S 
and  A  let  Cg  be  the  covariance  matrix  for  the  residual  components  of  S  with  respect  to  A,  i.e. 


nA  _ /~i  r*  r*—  i  r* 

t-'S  —  L  S  ~  t'SAL  A  L  AS- 


Then 

Xmin(Cg)  >  Xmin(C) 

where  Xmtn(C)  is  the  minimum  eigenvalue  ofC. 


Finally,  the  last  spectral  lemma  we  need  is  simply  a  bound  on  the  relationship  between  the 
quadratic  form  bTQ~lb  and  the  norm  bTb. 
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Lemma  5.2.3.  Let  b  be  an  arbitrary  vector  and  Q  a  positive  definite  matrix.  Then 

bTb 


bTQ~lb  < 


^ min  0 QY 


r, 


Proof.  Adapting  the  argument  from  Das  and  Kempe  [2011],  we  have: 

bTQ~1b  vTQ~1v  1 

<  max - - - =  A max(Q)  = 


bTb 


v  vTv 


Amin(Q  ) 


Now,  to  analyze  the  actual  regularized  sparse  approximation  problem,  we  can  first  derive  an 
expression  for  the  gain  F(A  U  S)  —  F (A),  to  be  used  in  proving  the  later  theorems. 

Lemma  5.2.4.  For  some  X,  and  Y,  Let  F  be  as  given  in  Equation  (5.5)  with  regularization 
parameter  X.  Then 


where 


F(A  OS)-  F(A)  =  l-bf(C£'  +  XI)~1bs'.i 


b§'  =  bs-CSA  0 0A  +  XI)-1bA 
cf  =  Cs-Cs A  (C A  +  XI)-1  CAS 


n 


Proof.  Starting  with  the  definition  of  F  from  Equation  (5.5),  we  have 


max 


hTA\jsw  -  2wT  (Caus  +  a  I)  w 


—  max 


bTAv  -  -vT  (CA  +  A I)  v 


If  we  break  the  matrix  up  in  to  blocks: 

C-Uus  = 


CA  Cas 
Csa  Cs 
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and  similarly  break  up  bAus,  we  can  rewrite  as: 

F(A  OS)-  F(A) 


=  max 

WS,WA 


b^ws  +  2 bAwA  -  (Cs  +  A I)  ws  -  waCasws  ~  ^wA  (CA  +  XI)  wA 


—  max 


bAv  -  -vT  (CA  +  XI)  v 


=  max 
ws 


bsws  -  A™s  (Cs  +  XI)  ws  +  max 

2  wA 


(bA  -  Casws)T  wa  -  -wA  (CA  +  XI)  wA 


max 

V 


bAv  -  -vT  (CA  +  XI)  v 


Solving  for  the  optimizations  over  A  directly  gives 


=  max 

WS 


bsws  -  ( Cs  +  XI)  Ws  +  ^  (bA  -  Casws)T  (Ca  +  XI)  1  (bA  -  CAsws) 


=  max 
ws 


-bTA(cA  +  xrr1bA 

bsws  -  )ws  (cs  +  A  I)  Ws  +  )w^CSA  (CA  +  Xiy1  CASws 
-1 


-  {CsA(CA  +  \iy  bA)  ws 
+  \i >A  (CA  +  XI)-1  bA  -  l-bTA  (CA  +  XI)-1  bA 


=  max 

ws 


bs'Tws  ~  (eg  +  Xl)  ws 


where 


bg  =  bs-Cs  A(CA  + XI)-1  bA 


-i 


eg  =Cs-Csa(Ca  +  XI)-1Cas 


-1 


Solving  this  completes  the  proof: 

1 


max 

WS 


bj'Tws  -  -wl  (cf  +  A/)  ws 


^  i.A't tr<A' 


=  xbf  (cf  +  a  iy'bf. 


As  an  aside,  tying  back  to  the  unregularized  case  [Das  and  Kempe,  2011],  the  correspond¬ 
ing  result  is  equivalent  to  setting  A  =  0,  where  the  b  and  C  matrices  are  simply  the  residual 
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covariance  vector  and  matrix  for  the  set  S,  with  respect  to  the  set  A: 

bj  =  bs  ~  CSACA]bA 
Us  —  L-s  ~  U$AUA  L-As, 


L 


so  the  regularized  form  subsumes  the  previous  result  as  one  would  expect. 


We  can  now  derive  the  approximate  submodularity  bound  for  the  Tikhonov  regularized  version 
of  the  problem,  using  the  above  lemmas. 


Theorem  5.2.5.  Let  variables  X,  be  zero  mean,  unit  variance  random  variables  and  Y  be  a 
target  variable.  Let  Am;„(C )  be  the  minimum  eigenvalue  of  the  covariance  matrix  of  the  vari¬ 
ables  X).  Then  F,  as  given  in  Equation  (5.5)  with  regularization  parameter  A  is  approximately 
submodular  for  all  7  <  A'”"^+A  <  and  <5  =  0. 


^ Proof.  From  Definition  4.2.1,  we  need  to  show  that  the  following  holds: 

7  [f(.4  u  s)  -  n/t)i  -  a  <  £  a  u  w)  -  n-4)i  ■ 

x£S 

By  Lemma  5.2.4,  the  left  hand  side  can  be  simplified  to 

F(A US)  -  F(A)  =  T(Cf  +  A iylbf. 

We  can  do  the  same  thing  to  the  right  hand  side  and  get 

£  F(A  US)  -  F(A)  =  £  ff/icfd  +  ur'bfd 

x£S  xGS 

=  ^'Aiiag  (Cf  +  MT'bf. 

We  can  lower  bound  the  right  hand  side  further  using  the  variance  bound  on  the  variables 
Xi,  giving  Cx<  1  and  the  fact  that  CxA  (CA  +  XI)  1  CAx  >  0,  to  find 

rdiag(CA'  +  A I)~'bg  >  ^tta" 
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Thus  it  suffices  to  find  7  and  5  such  that 


7  (]+++  + A /)-■<£') 


Using  Lemma  5.2.3,  we  know  that 

\ucf  +  A/)  (y T(C/  +  A /)-+')  <  &++ , 


so  the  bound  holds  for  7  < 


^rninfC'^  +A/)  Amjn(C^  )  +  A  _  Q 


1+A  1+A 

Using  Lemma  5.2.1  and  Lemma  5.2.2,  we  know  that 


Amin (Cf)  >  A rain(C#)  >  Amin(C'), 


At  A 


giving  the  final  bound: 

Amin(C')  +  A  A 

■7  <  -  <  - . 

1  ~  1  +  A  -  1  +  A 

L  ■ 

This  result  subsumes  the  previous  result  given  by  Das  and  Kempe  [2011],  and  is  identical 
for  the  unregularized,  A  =  0  case.  Additionally,  it  indicates  that  if  the  optimal  weight  vector 
is  small  and  not  substantially  affected  by  strong  regularization,  the  constants  in  the  approximate 
submodularity  bound  are  stronger  than  those  given  in  the  previous  result,  especially  when  the 
spectral  bound  Amin(U)  is  small. 

We  can  also  derive  a  similar  bound  for  the  approximation  error  introduced  by  using  the  Or¬ 
thogonal  Matching  Pursuit  algorithm  on  this  problem. 


Theorem  5.2.6.  Let  variables  X,  be  zero  mean,  unit  variance  random  variables  and  Y  be  a 
target  variable.  Let  \mj„(C )  be  the  minimum  eigenvalue  of  the  covariance  matrix  of  the  variables 
X,.  The  OMP  algorithm  applied  to  F  as  given  in  Equation  (5.5)  with  regularization  parameter 
A  is  ( a ,  0)- approximately  greedy  as  given  in  Definition  4.3.1,  with  a  =  A,"'”(f^+A. 


^ Proof.  The  OMP  algorithm  (Algorithm  5.2)  at  iteration  j  +  1  selects  the  element  x*  which 
maximizes 


x  =  arg  max 


E|(y 


w’TXv 


3- 1 


)Ta:l2 


c(x) 


w*  =  argminE[(y  —  wTXT>j_1)2]. 

W 


where 
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We  can  compute  w*  using  the  covariance  vector  b  and  matrix  C  directly  to  be 

w*  =  {cVj  +  \iylbVj. 


So  OMP  selects  the  element  which  maximizes: 

E [{Y  -  ((cy  +  XI)-1  bvy  xv.yx]2 

c(x) 

(bx  -  ((cy  +  XI)1  bv^j7  E[Xv._lX]j 
c(x) 

(t,  -  ((Cv,  +  xi)-1  bVi)T  cy 


where 

This  implies  that 


by' =  bx-CxDj  {cVj  +  xi) 


T)  T)  f  T)  ^  T)  f 

h  3  h  3  h  3  h  3 

ur*  ur*  Ux  Ox 

>  max - — — 


c(x*)  x  c(x) 

By  Lemma  5.2.4,  the  gain  for  a  single  element  x  at  iteration  j  +  1  is 

m  u  w>  -  f(v,)  =  \bv/T  y + \iy‘  bv/. 


Using  Lemma  5.2.1  through  Lemma  5.2.3,  in  a  similar  argument  to  the  previous  proof,  we 
find  that,  for  all  x 


>  (A„i„(C)  +  A)  bvxi'T  y  +  A/)  6? 
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Using  this,  along  with  the  fact  that  CXJ  <  1,  gives 

F(Vj  U  {a;*})  -  F(Vj)  _  b '*•  (C?'  +  XI)  bx* 


-l 


Vj' 


c(x*) 


L 


completing  the  proof. 


> 


1  b 


c(x*) 


1  +  A  c(x*) 


T  T  T>  1 

1  hUj  hUj 

^  1  UX  UX 

>  - r  max - — — 

1  +  A  x  c{x) 


>  max 


=  max 


Amm(C')  +  A  b: 


Vj,T  ( cf  j/  +  XI 


-1 


hi 

Ux 


1  +  A  c(x) 

Amin(C)  +  A  F(Vj  U  {x})  —  F(Vj) 


1  +  A 


c(x) 


This  result  also  subsumes  the  previous  corresponding  OMP  result  for  the  unregularized  version 
of  the  problem  [Das  and  Kempe,  2011].  Combined  with  the  previous  theorem  and  the  submodu- 
lar  optimization  analyses  in  Chapter  4  we  can  now  directly  derive  approximation  bounds  for  the 
regularized  sparse  approximation  problem. 

Corollary  5.2.7.  Let  7  =  A,""'(_^+A.  Let  F  be  the  regularized  sparse  approxmation  objective 
given  in  Equation  (5.5).  Let  S  =  (si, . . .)  be  any  sequence.  Let  D  =  (di, . . .)  be  the  sequence 
selected  by  the  greedy  Forward  Regression  algorithm.  Fix  some  K  >  0.  Let  B  =  ^2f=1  c(di). 
Then 

F(VlB))>(l-e-ii)F(S{c)). 

Similarly  let  D'  =  (d) . . . .)  be  the  sequence  selected  by  the  Orthogonal  Matching  Pursuit 
algorithm  and  B'  =  ^2f=1  c(d').  Then 

F(V'{B,))>(l-e-'i2*)F(S{c}). 


5.3  Constrained  Sparse  Approximation 

An  alternative  problem  to  consider  that  also  addresses  the  concern  of  large  weights  is  the  Ivanov 
regularized,  or  constrained  variant  of  the  sparse  approximation  problem.  Under  this  approach  we 
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constrain  the  weight  vectors  applied  to  the  selected  subset  to  lie  within  some  e-ball: 


F(S) 


W2]-  min  1e \(Y-wTXsn 

£  •u;ERl‘5l  ,||i(;||2<e 

max  bZw  —  -wTCsw. 
k;EMI‘5I  ,||i<;||2<e  2 


(5.7) 

(5.8) 


This  constrained  version  of  the  problem  allows  for  approximation  gaurantees  in  terms  of  a 
bound  directly  on  the  weight  vector  used  in  the  optimal  solution.  By  directly  constraining  the 
allowable  weights,  we  restrict  the  impact  that  large  weights  could  have  on  the  approximate  sub¬ 
modularity  of  the  problem.  Unlike  the  previous  regularization  approach,  however,  a  constrained 
approach  does  not  change  the  optimal  solution  to  the  problem,  as  long  as  the  constraint  is  suffi¬ 
ciently  large. 

We  can  now  detail  the  same  approximate  submodularity  and  approximately  greedy  bounds  for 
the  constrained  case.  We  can  first  derive  a  similar  result  to  Lemma  5.2.4  which  simplifies  the  gain 
F(A  U  S)  —  F (A)  for  the  constrained  problem. 


Lemma  5.3.1.  For  some  variables  Xt,  and  Y,  let  F  be  defined  as  in  Equation  (5.7)  with  con¬ 
straint  e.  Letw*  =  argmiii|ju,||  <e  E[(Y  —  wTXA)2}  be  the  weight  vector  which  maximizes  F  (A). 
Then 

T  1 

F(A  U  S)  -  F(A)  =  max  ( bAuS  -  CAuSw*AuS )  w  -  ~wTCAuSw, 

lr+<Uaj.s||2<e  2 


where 


W*AUS  ~ 


w* 

Os 


r, 


Proof.  Starting  with  the  definition  of  F  from  Equation  (5.7),  we  have 


F(A  U  5)  —  F(A)  =  max 
IMl2<e 


bAuSw  -  2wTCausW 


—  max 


bAv  ~  g  vT°av 


Let  wAuS  be  w*  extended  to  the  dimension  of  A  U  S  with  zeros: 


WAUS 


w 

Os 
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Re-writing  using  the  definition  of  w*  gives 

F{AoS)  -  F(A) 

=  .max  b^us  (■ w  +  w*AuS)  -\{w  +  w*AuS)tCa uS (w  + 

||io||2<e  Z 

Expanding  out  terms  and  cancelling  completes  the  proof: 

1 

bAuSw*AuS  +  bAuSw  ~  7,WTCAuSW  ~  W*AuSTCAuSW 


—  bAw*  +  -w*tCaw* 


=  max 


WAUS  CAdsWAuS 


bAw*  +  2  w*TCaw* 


=  max 


hAw*  +  bAuSw  -  -wtCAuSw  -  wAuStCAuSw 


-w*tCaw* 


bAw*  +  2  w*TCaw* 


=  max 

Ib+^usll,^  L 


2 

(^u5  -  CAuSwAuS)T  w  -  -wtCAuSw 


Unfortunately  in  the  constrained  setting  there  is  no  analytic  closed-form  solution,  as  there  was 
in  the  previous  case.  The  resulting  expression  for  the  gain  is  essentially  a  simplified  maximiation 
problem  with  a  modified  constraint. 

Now,  to  derive  the  desired  result  we  want  to  eventually  bound  the  term  for  the  combined  gain 
in  terms  of  the  individual  gains.  Unfortunately  a  complete  proof  of  this  bound  is  still  an  open 
problem.  We  are  able  to  prove  the  bound  for  a  number  of  special  cases. 

The  difficulty  in  getting  a  complete  proof  for  this  problem  is  the  lack  of  closed  form  solution 
when  dealing  with  the  constrained  version  of  the  problem,  in  particular  when  analyzing  the  gain 
for  a  set  S,  F(S  U  A)  —  F(A).  In  the  analysis  of  the  Tikhonov  regularized  and  unregularized 
versions  of  the  sparse  approximation,  the  convenient  closed  form  solutions  are  used  to  show  that 
this  difference  of  two  quadratic  optimizations  is  equivalent  to  a  single  quadratic  optimization  over 
only  the  variables  in  S. 

Conjecture  5.3.2.  Let  variables  X,  be  zero  mean,  unit  variance  random  variables  and  Y  be  a 
target  variable,  and  l>  and  C  the  appropriate  covariance  vector  and  matrix.  Let  S  and  A  be  sets 
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of  the  variables  Xv  Let  w*  =  arg  miii||w||  <e  E[(Y  —  wTXA)2].  Then,  for  all  7 


/  1 1  1  I  rji 

7  max  (bAuS  -  CAuSwAuS)  w  - -w  CAuSw 
\lr+^u5||2^e  2 

<  max  ibAu{x}  ~  CAu{x}wAu{x})T  w  -  l.wTCAu{x}w, 

r res  r+^uwl2-e 


where 


W*AUS  ~ 


w* 

Os 


^ Proof  This  conjecture  is  the  critical  failure  point  in  our  current  understanding  of  the  constrained 
sparse  approximation  problem.  We  do  not  currently  have  a  proof  of  the  complete  statement,  but 
we  can  prove  it  for  certain  special  cases. 

Specifically,  we  now  detail  a  proof  of  the  case  when  A  =  $.  From  numerical  experiments 
and  other  exploration  of  the  problem,  we  believe  that  this  case  is  the  tightest  case  for  the  bound. 

Eliminating  the  terms  that  depend  on  A,  we  can  reduce  the  left  and  right  hand  side  of  the 
problem  such  that  we  need  to  find  7  and  5  such  that: 


Using  the  fact  that  Cf  is  positive  semi-definite  and  Cauchy-Schwarz  we  can  upper  bound 
the  left  hand  side: 

7  I  max  b$w  —  -wTCsw  )  —  5  <  7||&,s||2e  —  5 
\IMl2<e  2  ) 

Similarly  we  can  reduce  the  right  hand  side  to  a  single  optimization  and  bound: 


Emax  bTy.1  w - wTCsx\w  =  max  b 

\\w\\o<e  W  9  W  . .  ^  ' 


c&S 


M’lloo<e 


jqw - w  w 

s  2 


>  max  b^w - wTw. 

IMl2<e  2 


Thus  it  suffices  to  find  7  and  5  such  that 


7llM|2e  -  $  <  max  b$w  -  -wTw. 

M  o<e  Z 
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When  || 65 1| 2  >  e,  the  right  hand  side  is  maximized  at  w  —  ^  ,  and  thus  we  want  7  and  5 

such  that 

1 

7llM2e-<*  <  ll&5||2e-  2g2 

which  holds  true  for  any  7  <  1  and  5  >  (7  —  f  )e2. 

When  ||65||2  <  e,  the  right  hand  side  is  maximized  at  w  =  bs,  giving: 

7llMI2e  —  $  —  \\\bs\\l 

which  holds  for  any  7  <  1  and  <5  >  Since  >  (7  —  |)  for  7  <  1,  the  second  set  of 
^constraints  satisfies  both  cases.  ■ 

Using  this  (conjectured)  result,  we  can  derive  the  corresponding  approximate  submodularity 
and  approximately  greedy  bounds.  For  the  remainder  of  this  document,  and  bounds  related  to  the 
constrained  setting  rely  on  the  previous  conjecture,  and  as  a  result  are  also  conjectured  results.  We 
will  also  note  this  for  each  conjectured  result. 

Theorem  5.3.3  (a).  Let  variables  Xt  be  zero  mean,  unit  variance  random  variables  and  Y 

be  a  target  variable.  Then  F,  as  defined  in  Equation  (5.7)  with  constraint  e  is  approximately 

2  2 

submodular  for  all  7  G  [0, 1]  and  5  >  yy . 

"This  theorem  requires  that  Conjecture  5.3.2  hold  in  general. 


r 


Proof.  Let  w*  =  arguimii^ii  <e  E[(Y  —  wT XX)2],  and 


w 

WAVJS  =  |  o5 

By  Lemma  5.3.1  the  left  hand  side  gain  can  be  re-written  as 


F(A  U  S)  -  F(A)  =  max  ( bAuS  -  CAuSwAus)  w  -  -wTCAuSw. 

Ih+w7u5||2^  z 

Similarly,  the  right  hand  side  gain  can  be  re-written  as 

F(A  U  M)  _  „  max  (bA u{*}  -  CAu {x}WAu{x})Tw  -  \wTCAu{x}w. 


x£S 


xGS  \\w+w*Aux\\2<^ 
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By  Conjecture  5.3.2,  the  following  holds: 


7  max  ( bAuS  -  CAuSu’AuS)T  w  -  -wTCAuSw 

Vlr+^usIL^6  A 

<  max  (*Uu{*}  -  CAu{x}wAu{x})T  w  -  \wTCa u{x}w, 

xeS  Ih+^u{x}||2^e 


2  2 

which  implies  that  the  theorem  holds  for  7  e  [0,1]  and  5  >  1^~.  ■ 

In  this  particular  case,  the  bound  holds  for  any  7,  with  larger  7  improving  the  multiplicative 
approximation,  but  also  weakening  the  additive  bound. 

The  matching  approximation  error  bound  for  the  OMP  algorithm  applied  to  the  constrained 
problem  is  also  an  open  problem,  but  analysis  of  special  cases  leads  us  to  believe  that,  as  in  the 
other  sparse  approximation  problems,  the  bound  matches  the  approximate  submodularity  bound. 

Theorem  5.3.4  (").  Let  variables  X,  be  zero  mean,  unit  variance  random  variables  and  Y  be 
a  target  variable.  Then  the  OMP  algorithm  appplied  to  F,  as  defined  in  Equation  (5.7)  with 
constraint  e  is  (cc,  (3)- approximately  greedy  as  given  in  Definition  4.3.1,  for  all  a  G  [0, 1]  and 

3  >  r'Y- 

"This  theorem  is  based  on  Conjecture  5.3.2  ultimately  being  true. 


r 


Proof.  Let  w*  =  arguimii^ii  <eE[(Y  —  wTXT>j)2]  and 


wx 


w 

O5 


The  OMP  algorithm  (Algorithm  5.2)  at  iteration  j  +  1  selects  the  element  x*  which  maxi¬ 
mizes 


x  =  arg  max 

x&X 


E[(Y  -  w*TXVj)Tx]2 
c(x) 


So  OMP  selects  the  element  which  maximizes: 

E  [{Y-w^Xv^fx]2  (bx 


w 


"CvX 


c(x) 


bVj 

Ox 


c(x) 
a  2 


c(x) 
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where 


bzj  =  bx  —  Cxy/w 


This  implies  that 


,v,'Ttv7' 


K  ’  b"i  ^  IC  ’  by' 

-2-t — f—  >  max - — — 

cla:*)  x  c(x) 


By  Lemma  5.3.1,  the  gain  for  a  single  element  x  at  iteration  j  +  1  is 

(FiVj  U  {a:})  -  F(V)j)  =  max  (bV{U{x}  -  Cv.u{x}w^.u{x}') 


1  T  r< 
w  —  -w  L 


T>jU{x}^V- 


X>7  U{x}  — 


Note  that  the  portion  of  bT>] u{z}  which  corresponds  to  x  is  exactly  the  bx  3  term  maximized 
by  OMP. 

We  hypothesize  that  the  same  argument  which  could  prove  Conjecture  5.3.1  should  be  able 
to  prove  the  rest  of  this  theorem  as  well.  Namely,  we  hypothesize  that  we  should  be  able  to 

T>  ,r^  T>  a 

bound  the  gain  of  the  selected  element  as  some  function  of  b,J  bxJ  .  Then,  using  the  OMP 

T>  ,r^  T>  a 

maximization  criteria  we  can  bound  that  in  terms  of  bx  3  bx  3  . 

T>  ,r^  T>  a 

Using  a  proof  of  Conjecture  5.3.2,  we  should  be  able  to  show  that  that  function  of  bx 3  bx3 
is  bounded  by  7 (F(Dj  U  {x})  —  F(V)j)  —  completing  the  proof.  ■ 


Assuming  that  these  conjectures  are  true  gives  a  set  of  corresponding  approximation  bounds 
for  greedy  approaches  to  the  constrained  sparse  approximation  problem. 

Corollary  5.3.5  (")•  Let  7  G  [0, 1].  Let  F  be  the  constrained  sparse  approxmation  objective 
given  in  Equation  (5.7)  with  constraint  e.  Let  S  =  (si, . . .)  be  any  sequence.  Let  D  —  (g^,  . . .) 
be  the  sequence  selected  by  the  greedy  Forward  Regression  algorithm.  Fix  some  K  >  0.  Let 
B  =  Yy=i  c(di)-  Then 

b  72e2  B 

F(1 >(fl))  >  (1  -  e-'c  )F(S{C})  -  7--. 

Similarly  let  D'  =  (d\ . . . .)  be  the  sequence  selected  by  the  Orthogonal  Matching  Pursuit 
algorithm  and  B'  =  C((A)-  Then 

f(v,b,})  >  (i— e^#)F(s(C))  - 

"This  theorem  is  based  on  Conjecture  5.3.2  ultimately  being  true. 
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5.4  Generalization  to  Smooth  Losses 

The  results  in  previous  sections  are  all  for  the  sparse  approximation  problem  which  directly  opti¬ 
mizes  the  squared  reconstruction  error  with  respect  to  some  target  Y.  In  many  domains  we  would 
like  to  use  the  same  basic  subset  selection  strategy,  but  with  a  different  loss  function  to  optimize.  In 
this  section  we  extend  the  previous  results  to  arbitrary  smooth  losses,  using  the  previous  analysis 
as  a  starting  point. 

Recall  that  a  loss  function  £  is  m-strongly  convex  if  for  all  x,  x 

£{x')  >  £(x)  +  (V£(x),xr  —  x)  +  ^||a/  —  A\2  (5.9) 

for  some  m  >  0,  and  M -strongly  smooth  if 

£(x')  <  £(x )  +  ( S/£(x),x ''  —  x)  +  ~  ^||2  (5.10) 

for  some  M  >  0. 

The  sparse  approximation  for  arbitrary  losses  simply  replaces  the  squared  error  with  a  smooth 
loss  i.  The  equivalent  problem  to  the  squared  loss  problem  in  coordinate  space  given  in  Equa¬ 
tion  (5.1)  is 

f(w)=E[£(wTXs)].  (5.11) 

We  can  turn  this  in  to  a  monotonic,  positive  set  function  by  subtracting  that  value  from  the 
starting  loss,  giving  the  equivalent  set  function  for  the  smooth  case: 

F(S)=  E[f(0)]-  min  E[£(wTXs)].  (5.12) 

To  continue  our  analysis  above,  we  can  also  generalize  the  Tikhonov  regularized  version  of  the 
problem.  The  function  to  optimize  in  coordinate  space  is 

f{w)  =  E [£{wTXs)  +  ^wTw\,  (5.13) 

and  the  equivalent  set  function  is 

F(S)  =  E[f(0)]  -  min  E [£{wT Xs)  +  ^ wTw ],  (5.14) 

weRi5!  2 

where  the  E[f(0)]  term  is  included  to  transform  the  problem  in  to  a  positive  set  function  maximiza¬ 
tion. 

We  can  now  analyze  the  approximate  submodularity  and  approximate  greedy  behavior  of  the 
regularized,  smooth  loss  version  of  the  problem.  First,  we  need  to  develop  upper  and  lower  bounds 
of  the  gain  terms,  using  the  strong  smoothness  and  strong  convexity  of  the  loss  i.  Unlike  the 
squared  loss  case,  we  don’t  get  an  exact  expression  for  the  gain  terms,  only  quadratic  upper  and 
lower  bounds. 

We  first  give  the  lower  bound  for  the  gain  terms,  utilizing  the  strong  smoothness  of  the  loss. 
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Lemma  5.4.1.  Let  F  be  as  given  in  Equation  (5.14)  for  £,  an  M -strongly  smooth  loss  as  in 
Equation  (5.10).  Letw*  =  argmin^E  [£(wtXa)+^wtw]  be  the  weight  vector  which  maximizes 
F(A),  and  Z  =  w* XA.  Then 

F(A  US)-  F(A )  >  l-bj’T(MC£'  +  A/)-1*#', 

where 


bj'  =  -E  [X£(Z)XS 


C£  =Cs-Csa[Ca  +  -I  Cas 


X 

W 


-i 


n 


Proof.  Starting  with  the  definition  of  F  from  Equation  (5.14),  we  have 


F(AU  S )  —  F(A)  =  minE[£(wT XA)  +  -wTw]  —  minE[£(wT XAuS)  +  —wTw\. 

w  2  w  2 

Let  wAuS  be  w*  extended  to  the  dimension  of  A  U  S  with  zeros: 


wAus 


w 

Os 


Using  w*  and  Z,  and  expanding  using  the  strong  smoothness  requirement  around  Z: 

F(A  US)-  F{A) 

=  E \£{Z)  +  ^ w*Tw*j  -  minE[£(Z  +  wT XAuS)  +  ^(w  +  wAuS)T(w  +  wAuS)} 

/  w  Z 

>  E [£(Z]  +  ^ w*Tw *}  -  minE \£{Z)  +  (V£(Z),wTXAuS)  +  y  ||wTA^u5||2 
+  +  WAuSf(w  +  wAuS)] 

=  maxE [-(V£(Z),wtXAuS)  -  y  ||wTX^u5||2  -  ^ wTw  -  XwTwAuS ] 

=  max E [ ( — V f ( Z ) A"_4U5  -  XwAuS)Tw]  -  ^ wT(MCAuS  +  A I)w 

w  Z 

Let  b'AuS  =  E[—V£(Z)XAus  —  XwAjS] .  The  astute  reader  will  notice  that  this  is  effectively 
the  negative  gradient  of  the  coordinate  loss  given  in  Equation  (5.13)  evaluated  at  wAuS : 


b'Au s  ~  ^ w*AuS  f  (w*Aus)  i 
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or  the  gradient  in  weight-space  of  the  function  we  are  trying  to  maximize,  at  the  current  solution. 
By  definition  of  w*,  this  is  must  be  some  vector: 


// 

°AUS  ~ 

which  is  0  across  all  dimensions  corresponding  to  A. 

Because  b[ AuS  is  zero  for  all  dimensions  of  A,  we  can  simplify  the  gain  by  solving  directly 
and  using  the  formula  for  block  matrix  inversion  on  M CAus  +  XI : 

F(A  US)-  F{A) 

-  2bAusT (MCAuS  +  XI)  1b'AuS 
=  \bf(Mcf  +  xiy'bf, 

where 

Cg  =  Cs-  Csa(Ca  +  Ji’UCas 

L  " 


0.4 

bg 


A  similar  argument  can  be  used  to  show  the  corresponding  strong  convexity  bound.  We  will 
omit  the  proof  here  because  it  is  largely  identical,  except  for  the  use  of  the  strong  convexity  lower 
bound  instead  of  strong  smoothness  upper  bound  on  the  loss  L 

Lemma  5.4.2.  Let  F  be  as  given  in  Equation  (5.14)  for  i,  an  m- strongly  convex  loss  as  in 
Equation  (5.10).  Letw*  =  arg  minu.  E  \i (wT XA ) + 1  wTw]  be  the  weight  vector  which  maximizes 
F(A),  and  Z  =  w*XA.  Then 

F(A  U  S)  -  F(A)  <  \bgT(mCg  +  M^bf, 

where 

bj’  =  -E  [V£(Z)XS\ 

CA  +  -i)  CAS 

m  J 


=CS 


c. 


SA 


Using  these  bounds,  we  can  bound  the  approximate  submodularity  of  the  smooth  sparse  ap¬ 
proximation  problem. 
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Theorem  5.4.3.  Let  variables  X,  be  zero  mean,  unit  variance  random  variables,  and  l  be  an 
m-strongly  convex  and  M-strongly  smooth  loss  function  as  given  in  Equations  (5.9-5.10).  Let 
\niniC )  be  the  minimum  eigenvalue  of  the  covariance  matrix  of  the  variables  X,.  Then  F,  as 
given  in  Equation  (5.14)  with  regularization  parameter  A  is  approximately  submodular  for  all 


<  m\nm(C)  +  A  <  7  X  —  n 

'  —  M+A  —  M+ A’  unu  u  u‘ 


r, 


Proof.  Recall  that  for  approximate  submodularity  to  hold  we  need  to  show  that  there  exists  7,  5 
such  that 

7  (F(A  US)-  F(A))  -S<  (F(A  U  {a:})  -  F(A)) . 

Let  bg'  =  —  E(V((Z)XS).  Using  Lemma  5.4.1 


F(A  U  {x})  -  F(A)  >  ‘-bi  ■  ( MCf  +  \irx', 


where 


A 


Cf  =Cx-CxA[CA  +  -I  CAx. 


m 


-1 


Similarly  to  the  proof  of  Theorem  5.2.5,  using  Cx  —  1  and  CxA  (CA  +  ^/)  CAx  >  0,  we 
can  show  that 


bfyMcf  +  xiWbyXtXA, 


Considering  the  sum  over  S  we  have 

^2  F(A  U  {a:})  —  F(A)  > 


x£S 


b£b£ 

M  +  A 


for  a  bound  on  the  right-hand  side. 
For  the  combined  gain  we  have: 


F(AUS)  -  F(A )  <  +  \I)~llg, 


c£  =Cs-  Csa  CA  +  -I)  CAS. 


m 


where 


-1 
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By  Lemma  5.2.3, 


b£T{mCf  +  XI)-1  b^'  < 


bfbj' 

Ami„(rnC/  +  A/)' 


We  can  now  bound  the  Amin  term: 


Amin (mCf  +  XI)  =  mXmia(Cf)  +  A  >  mXmin(Cg)  +  A  >  mXmin(C)  +  A 

using  the  definition  of  Cg'  in  the  second  step,  Lemma  5.2.1  in  the  third  step  and  Lemma  5.2.2 
in  the  final  step. 

So  now,  setting  7  =  ,"A^"(|CA)+A  and  6  =  0  we  have 

mAM(+A+A(f(-4U5)~F(-4)) 

<  ~^Tl±Xbf  (mci'  +  A I)~'b? 

<1  bflt 

~  2  M  +  A 

<^(F(AU{x})-F(A)), 

x£S 


L 


completing  the  proof. 


This  result  extends  the  squared  loss  case  to  all  losses  bounded  by  quadratics.  One  thing  to  note 
is  that,  in  the  case  where  Amin(G'j  is  small,  the  mXmin(C")  term  contributes  negligibly  to  the  bound, 
and  so  we  can  drop  the  requirement  that  i  is  strongly  convex.  Instead,  in  this  case  we  only  require 
that  i  is  convex  (or  equivalently,  that  I  is  0- strongly  convex). 

These  bounds  also  similarly  extend  to  the  OMP  algorithm  applied  to  the  smooth  sparse  ap¬ 
proximation  setting.  In  the  smooth  loss  setting,  the  only  change  to  the  OMP  algorithm  given  in 
Algorithm  5.2  is  the  gradient-based  selection  step.  In  the  squared  loss  case,  we  implicitly  stated 
the  gradient  for  the  squared  loss  in  the  selection  criteria.  Given  a  currently  selected  set  TXj-i,  the 
smooth  loss  equivalent  of  the  OMP  algorithm  is  to  first  find  w*  using 


w*  =  argminE  [I{wT  Xx,.^]. 

W 


The  next  element  is  then  selected  using 


x  =  arg  max 


E  [-W(w*'i'X2?._1):ra;]2 


c(x) 


where  Vf  is  the  gradient  of  the  chosen  smooth  loss  function. 
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We  now  extend  the  approximately  greedy  bound  to  this  setting,  getting  the  same  multiplicative 
constant  as  the  one  derived  in  the  approximately  submodularity  bound,  as  expected. 


Theorem  5.4.4.  Let  variables  X,  be  zero  mean,  and  I  be  an  m-strongly  convex  and  M -strongly 
smooth  loss  function  as  given  in  Equations  (5.9-5.10).  Let  \min  (C)  be  the  minimum  eigenvalue 
of  the  covariance  matrix  of  the  variables  X,  .  Then  the  OMP  algorithm  applied  to  F  as  given  in 
Equation  (5.14)  with  regularization  parameter  A  is  approximately  greedy  for  a  <  mX"ff^+X  < 
mTv  and  (3  =  0. 


^ Proof.  Like  previous  cases,  the  proof  is  similar  to  the  proof  of  approximate  submodularity  in 
Theorem  5.4.3. 

Let  w*  =  a r g  m i n w  E |  t ( w  1  X-Dj )  +  XWTW]  be  the  weight  vector  which  maximizes  F(Vj), 
and  Z  =  w*X'Dj. 

Let  bg1  =  —E\N((Z)XS].  The  OMP  algorithm  (Algorithm  5.2)  at  iteration  j  +  1  selects 
the  element  x*  which  maximizes 


*  E  l-W((Z)x}2 

x  =  arg  max - — - . 

xe*  c(x) 


which  implies  that 


T)  T)  r 

h  3  k  3 

— —  >  max 


T)  T)  1 

b”3  b X 


c  [x* 


x  C(X 


Now,  to  lower  bound  the  gain  of  the  element  x*,  using  Lemma  5.4.1  and  the  same  technique 
from  the  proof  of  Theorem  5.4.3: 


F(AU  {i*})-  F(X)  >  - 


-i  i.A  ,tuA  1 

j-  U{x*}  U{a;*} 


2  M  +  A 


We  can  similarly  upper  bound  the  maximum  gain: 


maxF(.4.U  {a:})  —  F(A)  <  (mC f  +  XI)  1b£' , 

x  2 


where 


Cf  =  Cs 


Csa  Ca  H - 1 

m 


-l 


Cas. 
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Using  the  same  eigenvalue  bounds  as  the  previous  proof: 


bf T(mCf  +  A iy'bg  < 


bfbj' 

+  A 


So  now,  setting  a  =  and  /3  —  0  we  have 

FiVj^Uix*})  -F(Vhoo) 


> i  i  HA  HA 

2  M  +  A  c(x*) 

^  1  1  bi'Tbi' 

>  max  - 


>  max 


x  2  M  +  A  c(x) 

m\ min{C)  +  A  F(A  U  {x})  -  F{A) 
x  M  +  A  c(x) 


L 


completing  the  proof. 


Just  as  in  the  previous  approximate  submodularity  proof,  we  can  drop  the  strong  convexity 
requirement  and  still  obtain  an  approximately  greedy  bound. 

The  previous  two  theorems  give  the  following  overall  approximation  results,  when  combined 
with  the  analysis  in  Chapter  4. 


Corollary  5.4.5.  Let  I  be  an  m-strongly  convex  and  M -strongly  smooth  loss  function.  Let 
7  =  '  A  ■  Let  F  be  the  regularized,  smooth  sparse  approxmation  objective  given  in  Equa¬ 

tion  (5.14).  Let  S  =  (si, . . .)  be  any  sequence.  Let  D  =  (di, ...)  be  the  sequence  selected  by 
the  greedy  Forward  Regression  algorithm.  Fix  some  K  >  0.  Let  B  =  cid  j).  Then 

F(P(B))>(l-e-^)F(S<c>). 

Similarly  let  D'  =  (d\ , . . .)  be  the  sequence  selected  by  the  Orthogonal  Matching  Pursuit 
algorithm  and  B'  =  c{d'f).  Then 

F(V'{B,))>(l-e-^^)F(S{c)). 


We  can  also  analyze  the  same  smooth  loss  version  of  the  constrained,  or  Ivanov  regularized 
approach. 

The  smooth,  constrained  version  of  the  problem,  as  a  set  function,  is 

F[S)  =  E[f(0)]-  min  E[£(wTXs)].  (5.15) 

■uiSKl5!  ,||«!||2<e 
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We  also  hypothesize  that  we  should  be  able  to  derive  similar  bounds  for  the  smooth,  constrained 
case  as  in  the  constrained  case  for  squared  loss. 

We  can  derive  the  corresponding  lemmas  that  upper  and  lower  bound  the  gain,  as  in  the  regu¬ 
larized  smooth  loss  case. 

Lemma  5.4.6.  Let  F  be  as  given  in  Equation  (5.15)  for  I,  an  M -strongly  smooth  loss  as  in 
Equation  (5.10).  Let  w*  =  axgminMTOM  <eE[£(wT  XA)\  be  the  weight  vector  which  maximizes 
F(A),  and  Z  =  w*XA.  Then 

F(A  US)  -  F(A)  >  max  b'SuA  -  wTCAuSw 

IK^us||2<e  2 


where 


bsuA  =  ~E[(V£(Z)XAuS)\ 


n 


Proof.  Starting  with  the  definition  of  F  from  Equation  (5.15),  we  have 


F(A  U  S)  —  F(A)  =  min  E [£(wTXA)\  —  min  E[£(wTXAus)\. 
II  l^ll  2  —  €  11^112  — 6 

Let  wAuS  be  w*  extended  to  the  dimension  of  A  U  S  with  zeros: 

*  _  P07* 

wAuS  ~  0iS 


Using  w*  and  Z,  and  expanding  using  the  strong  smoothness  requirement  around  Z: 


F(A  US)-  F(A) 

=  E [£{Z)\  -  min  E [£(Z  +  wTXAuS )] 

Ih+^uslla^6 


>  E[((Z);  -  min  E[£(Z)  +  (X((Z),wTXAuS) +  —\\wTXAuS\\2] 
Ih+“’iu5||2^e  2 

=  max  E[-(W£(Z),wTXAus) —  ^-\\wTXAuS\\2] 

lb+«^us||2<e  2 

=  max  E[(-V £(Z)XAuS)\  ~  ^~wTCAuSw 


Let  bUus  =  n~V£(Z)XAuS\. 

This  completes  the  proof. 
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Just  as  in  the  regularized  case,  this  is  effectively  the  negative  gradient  of  the  coordinate  loss 
given  in  Equation  (5.11)  evaluated  at  the  current  optimal  weight  vector  wAuS: 

^AU  S  =  ~^w\uSf(w%jS), 

or  the  gradient  in  weight-space  of  the  function  we  are  trying  to  maximize,  at  the  current  solution. 

l  ■ 

Using  the  same  argument,  but  expanding  using  the  definition  of  strong  convexity  yields  the 
corresponding  upper  bound. 

Lemma  5.4.7.  Let  F  be  as  given  in  Equation  (5.15)  for  1,  an  m- strongly  convex  loss  as  in 
Equation  (5.9).  Let  w*  =  argmin||TOM  <(  E\((v:lXA)\  be  the  weight  vector  which  maximizes 
F(A),  and  Z  =  w* XA.  Then 

TTl 

F(A  US)  -  F(A)  <  max  b'SuA  -  —  wTCAuSw 


where 


bsuA  =  — E  [(V£(Z)XAUS)\ 


Using  these  bounds,  and  the  conjectures  in  Section  5.3,  we  also  hypothesize  that  we  should  be 
able  to  derive  corresponding  approximate  submodularity  and  approximately  greedy  bounds  for  the 
smooth,  constrained  case.  We  will  not  present  proofs  here,  but  just  state  the  conjectured  results. 
The  proofs  for  the  case  when  the  exsisting  set  A  =  0  are  relatively  straightfoward,  as  in  the 
conjectured  proof  of  Conjecture  5.3.1. 


Theorem  5.4.8  C).  Let  variables  X,  be  zero  mean,  unit  variance  random  variables,  and  (  be  an 
M -strongly  smooth  loss  function  as  given  in  Equation  (5.10),  with  M  >  1.  Then  F,  as  given  in 
Equation  (5.15)  with  constraint  e  is  approximately  submodular for  all  7  <  1  and  all  5  >  Mj2  e  . 


“This  theorem  requires  that  Conjecture  5.3.2  hold  in  general. 


Theorem  5.4.9  (").  Let  variables  X,  be  zero  mean,  unit  variance  random  variables,  and  I  an 
M-strongly  smooth  loss  function  as  given  in  Equation  (5.10),  with  M  >  1.  Then  the  OMP 
algorithm  applied  to  F  as  given  in  Equation  (5.15)  with  constraint  e  is  approximately  greedy  for 
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a  <  1  and  all  (3  >  . 

"'I’ll is  theorem  requires  that  Conjecture  5.3.2  hold  in  general. 

We  can  now  combine  these  (conjectured)  results  with  the  corresponding  greedy  optimizations 
from  the  previous  bound  and  achieve  the  following  (conjectured)  approximation  bounds. 

Corollary  5.4.10  (a).  Let  I  be  an  M -strongly  smooth  loss  function.  Let  7  G  [0, 1].  Let  F  be  the 
constrained,  smooth  sparse  approxmation  objective  given  in  Equation  (5.15)  with  constraint  e. 
Let  S  =  (si, . . .)  be  any  sequence.  Let  D  =  (di, ...)  be  the  sequence  selected  by  the  greedy 
Forward  Regression  algorithm.  Fix  some  K  >  0.  Let  B  =  c(^*)-  Then 

B  My2e2  B 

F(V{b))  >  (1  -  e~^)F(S{c))  - 

Similarly  let  D'  =  (d\ , . . .)  be  the  sequence  selected  by  the  Orthogonal  Matching  Pursuit 
algorithm  and  B'  =  ^=1  c(d,i).  Then 

F(V'{b,})  >  (1  -  e-T2#)F(S(c)  - - 

"This  theorem  requires  that  Conjecture  5.3.2  hold  in  general. 


5.5  Simultaneous  Sparse  Approximation 

We  will  now  examine  a  common  extension  of  the  sparse  approximation  problem,  the  simultaneous 
sparse  approximation  problem.  In  this  problem  we  want  to  select  a  subset  of  the  variables  X, 
that  best  reconstruct  multiple  target  signals  Y).  simultaneously  or  optimize  multiple  smooth  losses 
simultaneously.  In  this  setting  the  same  set  of  variables  is  selected  to  reconstruct  all  signals,  but 
the  linear  combination  of  the  variables  selected  is  allowed  to  vary  arbitrarily  for  each  signal.  More 
formally,  given  some  set  of  problems  F^,  the  objective  for  this  problem  is  just  the  sum  of  these  set 
functions.  For  example,  in  the  smooth,  regularized  case  from  Equation  (5.14),  the  simultaneous 
version  of  the  loss  is 


F(S)  =  YJFk(S)  (5.16) 

k 

=  y>[lfc(0)]  -  min  E[4 (wTXs)  +  ^-wTw\.  (5.17) 

— '  toeRl5!  2 

k 

One  example  of  a  problem  that  fits  in  this  setting  is  the  multiclass  setting  where  a  one-vs-all 
approach  is  used  in  combination  with  a  smooth  loss.  Each  of  the  resulting  smooth  loss  problems 
corresponds  to  one  of  the  set  function  F^.  Other  examples  include  reconstructing  multiple  targets 
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Yk  in  multi-output  regression,  or  other  settings  where  we  want  to  select  the  same  subset  of  features 
to  simultaneously  minimize  multiple  loss  functions. 

The  corresponding  Forward  Regression  algorithm  for  the  simultaneous  setting  is  very  straight¬ 
forward.  We  simply  select  the  element  that  maximizes  the  per-unit-cost  gain  of  the  complete 
objective: 

F(AU  {x})  -  F(A) 
c{x) 

This  is  equivalent  to  just  summing  the  gains  for  each  individual  sparse  approximation  problem  and 
then  dividing  by  the  cost  of  the  element.  It  is  relatively  easy  to  show  that  this  simultenous  Forward 
Regression  algorithm  has  the  same  guarantees  as  in  the  individual  setting. 

Theorem  5.5.1.  For  k  =  (1  ,...,K),  let  Fk(S)  be  (o/,- ,  8k )  -approximately  submodular.  Let 
F(S)  =  Fk(S).  Let  7  <  yk  for  all  k  and  5  =  J2k=i  $k-  Then  F((S))  is  (7,  5) -approximately 
submodular. 

'  Proof.  By  the  definition  of  approximate  submodularity  we  know  that,  for  k  =  (1 , ...  ,K) 

7fc  (Fk(A  US)  -  Fk(A))  -  4  <  y,  Fk(A  U  {x})  -  Fk(A). 

x£S 

Summing  over  k  we  have  that 

K  K 

Y  Tk  ( Fk(A  US)-  F(A))  -Sk<YYF^Au  {4)  -  F(A). 

k=  1  k=  1  x  E«S 

Now  using  7  <  7/,  and  5  =  Y^k=\  we  have 

j{F(AuS)-F(A))-5 

=  7  (Y'ykFk(AuS)-Fk(A)Sj  -5 

K 

<  Yj  Tk  (Fk(A  U  S)  —  Fk(A ))  —  8k 

k= 1 
K 

k= 1  x£S 

x£S 
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Algorithm  5.3  Simultaneous  Orthogonal  Matching  Pursuit 


Given:  elements  X,  target  y 
Define  F  as  in  Equation  (5.3) 
Let  VQ  =  0. 


for  j  —  1, ...  do 

for  all  k  do 

Let  w*k  =  arg minw  E[(£fe  -  wTXv._1)2\ 

end  for 


Let  x*  =  arg  maxx^ 
Let  T)j  =  Uj-i  U  {a;*}. 

end  for 


n(Yk-w*kT  Ap._1)Ta!]2 
c{x) 


This  result  shows  that,  given  a  simultaneous  sparse  approximation  problem,  it  is  approximately 
submodular  with  the  same  multiplicative  constant  as  the  loosest  one  of  each  of  the  corresponding 
individual  sparse  approximation  problems,  and  the  sum  of  the  additive  constants. 

The  adaptation  of  the  OMP  algorithm  is  also  intuitive.  Whereas  before  the  OMP  algorithm 
selected  the  element  which  maximized  the  squared  gradient 

*  E[VTa;]2 

x  =  arg  max - — — , 

ze*  c(x) 


we  now  select  the  element  which  maximizes  the  sum  of  squared  gradients  with  respect  to  each 
individual  sparse  approximation  problem 


x 


* 


arg  max 

x£X 


£ 


E[VH2 

c(x) 


A  complete  version  of  the  Simultaneous  OMP  [Cotter  et  al.,  2005,  Chen  and  Huo,  2006]  algo¬ 
rithm  is  given  in  Algorithm  5.3,  for  the  squared  loss  sparse  approximation  problem.  The  equivalent 
selection  criteria  for  the  smooth  loss  case  is 


x  =  arg  max 

XdX 


£ 


E[-V4(tt»g.YPj_,)Ti] 

c(x) 


In  a  manner  similiar  to  the  OMP  approximation  results  previously  shown,  we  can  derive  similar 
bounds  for  the  SOMP  algorithm  applied  to  those  settings.  The  analysis  is  not  as  straightforward  as 
the  analysis  of  the  Lorward  Regression  approach,  and  must  be  done  individually  for  each  different 
sparse  approximation  setting.  We  present  here  the  smooth,  regularized  version  of  the  proof. 
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Theorem  5.5.2.  For  k  =  (1 , ,K),  let  Fk(S)  be  the  regularized  sparse  approximation  prob¬ 
lem  for  an  m- strongly  convex  and  M -strongly  smooth  loss  Ik,  arid  regularization  parameter  A 
as  in  Equation  (5.14).  Let  F(S)  =  ^/d'S).  Then  the  SOMP  algorithm  applied  to  F((S )) 

is  (a,  0) -approximately  greedy,  for  a  =  '^AQ+x 


M+ X 


Proof.  This  proof  very  similar  to  the  proof  for  regular  OMP  applied  to  the  regularized,  smooth- 
loss  version  of  the  sparse  approximation  problem  in  Theorem  5.4.4. 

Let  w*k  =  arg  niinu,  Efk(wr  X-D.)  +  ^wTw]  be  the  weight  vector  which  maximizes  FiVf), 
and  Zk  =  w*k XVj. 

Let  b^3  k  =  — E [XIk(Zk)Xs\.  The  OMP  algorithm  (Algorithm  5.2)  at  iteration  j  +  1  selects 
the  element  x*  which  maximizes 


K 


X  = 


arg  max  p 


E[V4  {Zk)x\ 


x&X  ^  c(x) 


which  implies  that 


K  ,VfT,Vf 

h*k  bx'k 
c(x*) 


E 

k= 1 


>  max 


A'  &'kbVx\ 


E 


*  —  C(x) 


Now,  to  lower  bound  the  gain  of  the  element  x*,  using  Lemma  5.4.1  and  the  same  technique 
from  the  proof  of  Theorem  5.4.4: 

K  I<  la  'tuA  1 

E  Fk(A  U  {a})  -  Fk(A)  >  J2  2  <XM+fh- 

k= 1  k= 1  Z  + 

We  can  similarly  upper  bound  the  gain  of  an  arbitrary  element: 

K  I<  1 

V  Ft(A  U  W)  -  Ft(A)  <  V  -bffmCf  +  X  iy'bl'k, 


k= 1 


k= 1 


where 


Cf  =  CS-  CSA  (cA  +  -i)  Cas. 
V  m  ) 


Again  using  the  eigenvalue  bound  from  Lemmas  5.2. 1-5. 2. 3,  we  know  that 

t>sf(mC£'  +  A <  ^k  b*k 


^Amin(C<)  +  A 
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Algorithm  5.4  Grouped  Forward  Regression 


Given:  elements  X,  groups  T,  target  y 
Define  F  as  in  Equation  (5.3) 

Let  T?0  =  0. 


for  j  =  1, ...  do 

Let  Q*  =  arg  maxggr 
Let  Vj  =  T>j_ i  U  Q*. 

end  for 


j-1(P,-iUg)-F(-g>j-i) 

c(0) 


So  now,  setting  a  =  and  /3  =  0  we  have 

F(Vj^  U  {a;*})  -  F(Vhoo) 


K 

E 


FkiVj^  U  {x*})  -  Fk( Vhoo) 


c 

fT,  A  / 


k=l 

K  ,  ,  hA  '  ±  hA  ' 

>  1  1 

_^2M  +  A  c(x*) 

A  1  1  hA' T hA' 

>ma^yl  1  _LlAi 

x  2  M  +  A  c(x) 

mAmin(C')  +  A  ^  Fk(A  U  {a;})  —  F^(A) 
>  max - — - ; -  >  - — - 


x  M  +  A 


k= 1 


C[X) 


=  max 


m\min(C)  +  A  F(A  U  {a;})  -  F(A) 


x  XI  +  A 


c(x) 


L 


completing  the  proof. 


These  two  results  together  show  that,  for  a  given  set  of  sparse  approximation  problems  with 
identical  approximation  guarantees,  such  as  a  set  of  problems  all  with  the  same  regularization 
parameters  and  variables,  the  resulting  simultaneous  sparse  approximation  problem  has  identical 
approximation  guarantees  to  each  individual  problem  as  well. 


5.6  Grouped  Features 

One  final  extension  of  the  standard  sparse  approximation  problem  that  we  will  examine  is  the 
grouped  version  of  the  problem.  In  this  variant,  each  feature  belongs  to  some  group  Q,  and  the 
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Algorithm  5.5  Grouped  Orthogonal  Matching  Pursuit 


Given:  elements  X,  groups  T,  target  y 
Define  F  as  in  Equation  (5.3) 

Let  V0  =  0. 


for  j  =  1,. 
Let  w*  = 
Let  l&-' 

Let  g*  = 
Let  T>j  = 

end  for 


. .  do 

arg  min,a,  E[(Y  -  wTXVj_1)2] 
=  E[{Y-w*tXv.1)tx]. 


arg  maxg.gr 
Vj-i  ug*. 


bgj~lT  (Cg+XI)-^-1 


c(G) 


budget  and  costs  c(Q)  are  defined  over  the  groups  selected,  not  the  individual  features.  Selecting 
any  one  feature  within  the  group  is  effectively  equivalent  to  selecting  the  entire  group.  This  sce¬ 
nario  typically  arises  when  features  are  computed  using  some  common  process  or  derived  from 
some  base  feature. 

Lormally,  we  are  given  some  set  of  groups  T  =  {g i,  Q2,  ■  ■ .}  such  that  each  group  contains 
some  set  of  the  features  g  C  X.  We  additionally  assume  that  the  sets  form  a  partition  of  the  set  X, 

so  that  g  n  g1  =  0  for  ail  g,  g'  g  r  and  g  ^  g' . 

Let  F'  be  the  grouped  set  function  maximization,  or  equivalently,  the  corresponding  set  func¬ 
tion  over  the  raw  variables  in  the  selected  groups: 

F'(£)  =  F(«S(£))  (5.18) 

5(E)  =  (J  g.  (5.19) 

ees 

One  typical  solution  to  solving  this  problem  in  practice  is  to  use  the  standard  Lorward  Regres¬ 
sion  and  Orthogonal  Matching  Pursuit  algorithms  adapted  to  the  group  setting.  In  this  approach  the 
same  greedy  criteria  over  single  features  is  used,  but  the  entire  group  corresponding  to  the  selected 
feature  is  used  as  the  selected  group.  Effectively  this  approach  greedily  selects  groups  by  using  the 
max,  or  L0 0  norm  of  some  criteria  over  the  features  in  the  group.  Another  obvious  variant  of  the 
OMP  algorithm  is  to  use  the  L  >  norm  of  the  OMP  criteria  evaluated  over  the  features  in  the  group. 
To  be  concrete,  the  standard  OMP  criteria  maximizes  the  gradient  term: 

=  E[(Y'  -  w-TXVl_yx], 

specifically  the  squared  term: 


The  single  feature  version  of  the  grouped  OMP  approach  selects  the  group  Q  which  maximizes: 
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while  the  other  OMP  variant  maximizes 


There  are  simple  counter-examples  that  show  where  both  of  these  approaches  fail.  For  the  L ^ 
approach,  the  algorithm  will  pick  groups  that  have  one  single  good  feature,  while  other  groups  may 
contain  arbitrarily  many  good,  but  slightly  worse,  features.  The  L2  approach  fails  in  an  opposite 
fashion.  Given  a  group  with  many  identical  copies  of  the  same  feature,  the  L2  criteria  will  be  very 
high,  when  the  group  may  in  fact  have  very  little  benefit. 

We  propose  two  group-based  variants  of  the  FR  and  OMP  algorithms,  given  in  Algorithm  5.4 
and  Algorithm  5.5  to  fix  these  problems.  The  FR  variant  of  the  algorithm  is  a  very  natural  greedy 
strategy  evaluated  over  the  groups  instead  of  the  individual  features.  The  OMP  approach  uses 
a  method  similar  to  the  L2  criteria  proposed  above,  but  modifies  this  by  instead  computing  the 
quadratic  form 

(Cg  +  XI)-1  b^~\ 

An  alternative  way  to  view  this  OMP  variant  is  that  it  is  identical  to  the  L2  approach,  but  with 
the  added  caveat  that  the  data  is  whitened  within  each  group  prior  to  running  the  algorithm. 

We  do  not  present  approximation  guarantees  here,  but  we  propose  that  similar  approximation 
guarantees  to  the  OMP  and  FR  algorithms  in  the  standard  sparse  approximation  setting  can  likely 
be  extended  to  this  setting  as  well. 


5.7  Experimental  Results 

We  now  present  a  number  of  results  on  practical  applications.  In  out  results  we  will  compare 
the  optimal  feature  set  for  a  given  budget  (OPT),  Forward  Regression  (FR),  Orthogonal  Matching 
Pursuit  (OMP),  and  Lasso  (LI).  For  FR  and  OMP,  we  use  the  cost-greedy  variants  presented  here, 
as  well  as  comparisons  against  the  original  implementations  for  uniform  feature  costs  [Pati  et  al., 
1993,  Miller,  2002], 

For  the  Lasso  approach,  we  utilize  two  different  approaches.  For  strictly  regression  settings  we 
use  Least  Angle  Regression  [Efron  et  al.,  2004]  as  implemented  by  the  lars  R  package  [Hastie 
and  Efron,  2013].  For  the  multiclass  experiments  that  will  follow,  we  utlize  the  LI  regularization 
feature  of  Vowpal  Wabbit  [Langford,  2013]  to  generate  the  Lasso  regularization  path.  Finally,  in 
the  Lasso  approaches,  we  take  the  set  of  variables  given  by  the  LI  regularization  path,  and  retrain 
the  model  without  the  LI  constraint  to  obtain  the  optimal  performance  for  a  given  budget. 

For  datasets,  we  use  the  UCI  Machine  Learning  Repository  [Frank  and  Asuncion,  2010], 
specifically  the  ‘housing’  and  ‘wine  quality’  datasets  for  regression  problems  and  the  ‘pendigits’ 
and  ‘letter’  datasets  for  multiclass  classification  problems. 

For  all  problems  we  normalize  the  features  to  unit  variance  and  zero  mean,  and  for  regression 
problems  we  normalize  the  target  as  well.  The  ‘housing’  dataset  uses  d  =  13  features,  while  the 
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Figure  5.2:  Performance  of  optimal  (OPT),  Forward  Regression  (FR),  Orthogonal  Matching  Pursuit  (OMP) 
and  Lasso  (LI)  algorithms  on  the  UCI  ‘housing’  dataset,  for  various  parameterizations  of  the  regularized 
(left)  and  constrained  (right)  sparse  approximation  problems. 


‘wine  quality’  dataset  has  d  —  11  features.  For  the  multiclass  problems,  ‘pendigits’  has  d  —  16 
features  and  k  —  10  classes,  while  ‘letter’  has  d  —  16  features  and  k  =  26  classes. 

Figure  5.2  shows  the  results  for  all  algorithms  on  the  ‘housing’  dataset  using  uniform  costs, 
while  varying  the  regularization  and  constraint  parameters  A  and  e  for  the  regularized  and  con¬ 
strained  sparse  approximation  problems.  We  use  A  e  {0,  0.1,  0.25,  0.5}  and  e  G  {1,  0.5,  0.4,  0.3} 
here.  As  predicted  by  our  bounds,  we  see  the  greedy  algorithms  converge  to  optimal  performance 
as  the  constraint  or  regularization  increasingly  penalizes  larger  weights.  We  also  observe  that 
the  Lasso  algorithm  converges  as  the  weight  vector  is  increasingly  penalized,  but  is  consistently 
outperformed  by  the  greedy  approaches. 

Figure  5.3  demonstrates  the  performance  of  all  algorithms  for  the  two  regression  datasets  on 
the  budgeted  sparse  approximation  problem.  Here  we  use  synthetic  costs,  as  no  costs  are  provided 
with  the  datasets  used.  The  costs  are  sampled  from  a  gamma  distribution  with  parameters  k  = 
2,6  =  2.0,  to  ensure  that  there  are  a  mix  of  expensive  and  cheap  costs.  All  results  are  average  over 
20  different  sets  of  sampled  costs. 

To  obtain  a  cost-based  or  budgeted  version  of  the  Lasso  algorithm,  we  scale  each  feature  x  by 
the  inverse  of  the  cost  c(x),  effectively  scaling  the  weight  which  is  penalized  in  the  L  \  term  by 
c(x),  giving  a  cost-scaled  version  of  the  L  \  norm.  This  approach  requires  that  you  ensure  that  the 
Lasso  implementation  used  is  not  re-normalizing  the  features  at  any  point,  which  we  have  done. 

We  see  the  same  behavior  here  that  we  saw  in  the  first  comparison  and  in  our  theoretical  bounds. 
Namely,  that  Forward  Regression  is  the  closest  to  optimal  performance,  followed  by  Orthogonal 
Matching  Pursuit,  with  the  Lasso  approach  typically  giving  the  worst  performance. 

For  the  smooth  and  simultaneous  sparse  approximation  settings,  Figure  5.4  demonstrates  cost- 
greedy  algorithms  as  well  as  their  uniform  cost  counterparts  on  two  multiclass  datasets.  Here  we 
use  the  one-vs-all  approach  to  multiclass  classification,  using  the  logistic  loss  for  each  of  the  k 
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'housing' 


'wine  quality' 


Figure  5.3:  Performance  of  optimal  (OPT),  Forward  Regression  (FR),  Orthogonal  Matching  Pursuit  (OMP) 
and  Lasso  (LI)  algorithms  on  the  UCI  ‘housing’  and  ‘wine  quality’  datasets,  for  the  budgeted  sparse  approx¬ 
imation  problem  using  synthetic  costs.  Results  are  averaged  over  20  sets  of  sampled  synthetic  costs. 


resulting  binary  classification  problems.  This  gives  an  overall  sparse  approximation  problem  with 
k  simultaneous  smooth  losses  with  strong  smoothness  parameter  M  —  1. 

For  feature  costs,  we  use  the  same  synthetic  cost  sampling  procedure  described  above,  also 
averaged  over  20  different  sets  of  costs.  In  this  problem  we  see  the  same  behavior  observed  in 
Figure  5.3  with  Forward  Regression  giving  the  best  overall  trade-off  of  cost  and  accuracy,  followed 
by  Orthogonal  Matching  Pursuit  and  then  Lasso.  We  do  not  compare  to  optimal  performance  for 
these  experiments  because  it  is  significantly  more  expensive  to  train  a  logistic  regressor  than  a 
regular  least  squares  fit  for  all  subsets  of  the  the  features. 

Additionally,  these  results  show  that  using  the  cost-greedy  variant  of  each  algorithm  is  critical 
to  obtaining  good  performance  on  the  budgeted  problem.  The  variants  intended  for  use  in  the 
uniform  cost  case  significantly  underperform  in  this  setting.  Although  we  do  not  know  of  any 
analysis  for  the  budgeted  setting,  our  results  also  indicate  that  a  version  of  the  Lasso  algorithm 
which  uses  a  weighted  L  \  norm  significantly  outperforms  the  traditional  unweighted  approach 
when  dealing  with  budgeted  feature  selection  problems. 

As  another  example  of  a  budgeted  feature  selection  problem,  we  use  the  Yahoo!  Learning 
to  Rank  Challenge  dataset  augmented  with  feature  costs  [Xu  et  al.,  2012].  Though  this  is  a 
ranking  problem,  we  use  the  regression  version  of  the  problem.  Each  document  in  the  dataset 
is  paired  with  a  relevance  ranking  in  {0, 1,2,  3, 4},  and  we  use  the  normalized  vector  of  rele¬ 
vances  as  the  regression  target  Y.  The  dataset  consists  of  519  features,  with  costs  drawn  from 
{1,  5, 10,  20,  50, 100, 150,  200}.  The  full  training  dataset  contains  473134  examples,  but  we  use 
only  the  first  200000  as  the  Lasso  implementation  used  in  these  experiments  required  the  full 
dataset  to  be  stored  in  memory,  and  our  test  machine  did  not  have  enough  memory  for  the  com¬ 
plete  dataset. 

Figure  5.5  gives  the  results  of  both  cost-greedy  and  uniform  variants  of  all  algorithms  on  this 
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Figure  5.4:  Performance  of  cost-greedy  and  uniform  cost  variants  of  Forward  Regression  (FR),  Orthogonal 
Matching  Pursuit  (OMP)  and  Lasso  (LI)  algorithms  on  the  UCI  ‘letter’  and  ‘pendigits’  datasets,  for  the 
budgeted  sparse  approximation  problem  using  synthetic  costs.  Results  are  averaged  over  20  sets  of  sampled 
synthetic  costs. 


dataset.  Here  we  see  the  same  basic  behavior  as  all  previous  datasets,  with  a  slightly  less  pro¬ 
nounced  advantage  to  the  cost-greedy  variants  over  the  uniform  cost  algorithms.  Figure  5.6  gives  a 
comparison  of  the  training  time  required  for  all  algorithms,  as  a  function  of  the  number  of  features 
selected.  Note  that  in  the  regression  case,  all  algorithms  first  compute  the  covariance  matrix  C  and 
vector  b,  and  then  operate  on  this  reduced  (d  =  519  dimensions)  space,  so  the  training  time  should 
not  scale  with  the  number  of  training  examples,  except  for  the  initial  fixed  cost  (here  14.41s)  of 
computing  these  quantities.  Overall  we  observe  that  the  OMP  and  LI  algorithms  are  fairly  effi¬ 
cient,  even  for  large  numbers  of  features,  while  the  FR  algorithm  is  nearly  two  orders  of  magnitude 
more  expensive. 

Finally,  we  present  experimental  results  for  the  grouped  feature  setting.  In  this  setting  we 
contrast  the  proposed  grouped  FR  (Algorithm  5.4)  and  grouped  OMP  (Algorithm  5.5)  algorithms, 
the  optimal  group  selections,  the  single  feature  versions  of  the  FR  and  OMP  algorithms  (the  Lx 
approach  described  in  Section  5.6)  and  the  L2  version  of  OMP,  which  is  equivalent  to  the  proposed 
algorithm  with  the  added  assumption  that  the  data  is  already  whitened  within  groups. 

For  the  first  dataset  we  use  the  same  Yahoo  Learning  to  Rank  data  from  the  previous  results,  but 
with  randomly  generated  groups  (Figure  5.7).  We  randomly  distribute  the  d  =  519  features  evenly 
among  17  groups  of  approximately  30  features  each.  All  results  are  averaged  over  20  randomly 
sampled  sets  of  groups.  The  second  dataset  used  is  synthetically  sampled  data  (Figure  5.8).  We 
generate  d  =  160  features  for  100  examples  randomly,  such  that  the  features  have  moderately  high 
correlations  of  0.6,  in  a  manner  similar  to  Das  and  Kempe  [2011].  The  groups  are  also  randomly 
sampled  with  16  groups  of  10  features  each.  We  then  randomly  sample  half  of  the  groups  (8)  and 
use  uniform  weights  over  the  features  in  those  groups  to  construct  the  target  vector,  along  with 
added  noise  ( a  =  0.1). 
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Figure  5.5:  Performance  of  cost-greedy  and  uniform  cost  variants  of  Forward  Regression  (FR),  Orthogonal 
Matching  Pursuit  (OMP)  and  Lasso  (LI)  algorithms  on  the  budgeted  version  of  the  Yahoo!  Learning  to 
Rank  dataset. 


In  these  experiments  we  see  that  the  grouped  version  of  the  FR  algorithm  clearly  dominates 
and  is  closest  to  optimal,  while  the  grouped  version  of  the  OMP  algorithm  is  slightly  worse.  The 
L-2  variant,  or  non-whitened  version  of  the  grouped  OMP  approach  is  substantially  worse  than  all 
algorithms.  The  L ^  or  single  feature  versions  are  significantly  sub-optimal  for  small  numbers  of 
selected  groups,  due  to  their  inability  to  measure  the  overall  benefit  of  the  groups,  but  can  some¬ 
times  (synthetic  data)  outperform  the  grouped  approaches  for  larger  numbers  of  selected  groups. 
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Figure  5.6:  A  comparison  of  the  training  time  required  for  various  numbers  of  selected  features  for  Forward 
Regression  (FR),  Orthogonal  Matching  Pursuit  (OMP)  and  Lasso  (LI)  algorithms  on  the  Yahoo!  Learning 
to  Rank  dataset.  Note  the  log-scale  of  training  time. 


Yahoo!  LTR  Challenge  (Grouped) 


Figure  5.7:  Comparison  of  group  selection  approaches  for  the  grouped  feature  selection  problem  with 
randomly  sampled  groups  for  the  Yahoo!  Learning  to  Rank  data  (left)  along  with  a  zoomed  portion  of  the 
results  (right).  Algorithms  compared  are  the  grouped  FR  and  OMP  variants,  single  feature  or  Loo  versions, 
and  the  un-whitened  or  L2  variant  of  the  OMP  approach. 
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Figure  5.8:  Comparison  of  group  selection  approaches  for  the  grouped  feature  selection  problem  with 
synthetic  data.  Algorithms  compared  are  the  grouped  FR  and  OMP  variants,  single  feature  or  L0 0  versions, 
and  the  un-whitened  or  L2  variant  of  the  OMP  approach. 


Part  III 

Anytime  Prediction 
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Chapter  6 

SpeedBoost:  Anytime  Prediction 
Algorithms 


In  this  chapter  we  will  now  combine  the  algorithms  and  analysis  from  the  previous  chapters  to 
tackle  the  anytime  prediction  problem  we  originally  set  out  to  study.  Specifically,  we  will  combine 
the  widely  applicable  framework  of  functional  gradient  methods  with  the  cost-greedy  algorithms 
and  theoretical  guarantees  from  our  analysis  of  sparse  approximation  methods,  to  obtain  methods 
for  building  ensemble  predictors  with  similar  near-optimal  guarantees. 


6.1  Background 

Recall  the  desirable  properties  for  anytime  algorithms  given  by  Zilberstein  [1996]: 

•  Interruptability:  a  prediction  can  be  generated  at  any  time. 

•  Monotonicity:  the  quality  of  a  prediction  is  non-decreasing  over  time. 

•  Diminishing  Returns:  prediction  quality  improves  fastest  at  early  stages. 

To  accomplish  these  goals  we  will  rely  heavily  on  the  two  major  areas  examined  in  the  earlier 
parts  of  this  work.  To  obtain  the  incremental,  interruptable  behavior  we  would  like  for  updating 
predictions  over  time  we  will  learn  an  additive  ensemble  of  weaker  predictors.  This  aspect  of 
our  anytime  approach  is  based  on  the  functional  gradient  framework  detailed  in  Chapter  2.  By 
learning  a  sequence  of  weak  predictors,  represented  as  a  linear  combination  of  functions  h,  we 
naturally  have  an  interruptable  predictor.  We  simply  evaluate  the  weak  predictors  in  sequence 
and  compute  the  linear  combination  of  the  outputs  whenever  a  prediction  is  desired,  allowing  for 
predictions  to  be  updated  over  time. 

Building  on  that  foundation,  we  will  then  introduce  the  cost-greedy  strategies  studied  in  Chap¬ 
ters  4  and  5.  This  augmented  cost-greedy  version  of  functional  gradient  methods  is  simply  an 
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extension  of  the  sparse  approximation  setting  discussed  in  the  previous  chapter,  with  each  weak 
predictor  representing  a  variable  to  be  selected.  As  we  have  shown  previously,  using  a  cost-greedy 
approach  ensures  that  we  select  sequences  of  weak  predictors  that  behave  near-optimally  and  in¬ 
crease  accuracy  as  efficiently  as  possible,  satisfying  the  last  two  properties. 

We  will  now  develop  similar  algorithms  to  the  Forward  Regression  (FR)  and  Orthogonal 
Matching  Pursuit  (OMP)  algorithms  discussed  in  Chapter  5  specifically  for  the  functional  gradient 
domain,  which  will  generate  sequences  of  weak  predictors  h  that  obtain  good  anytime  performance 
across  a  range  of  computational  budgets.  Specifically,  we  will  show  in  Section  6.4  that  the  theoret¬ 
ical  results  for  sparse  approximation  studied  in  Chapter  5  all  generalize  to  certain  variants  of  our 
anytime  approach. 


6.2  Anytime  Prediction  Framework 

Building  from  the  functional  gradient  framework  discussed  in  Chapter  2,  we  will  base  our  frame¬ 
work  for  anytime  prediction  around  the  additive,  incremental  predictors  discussed  there.  In  the 
anytime  setting,  as  in  the  functional  gradient  setting,  we  consider  predictors  /  :  X  — >■  V  which 
compute  some  prediction  f(x)  G  V  for  inputs  x  G  X 

To  obtain  the  incremental  improvement  in  performance  we  desire  for  our  anytime  learner,  we 
use  the  additive  predictors  /  from  functional  gradient  methods,  which  are  a  weighted  combination 
of  weaker  predictors  h  e  TL 

f(x)  =  y ^aihjjx),  (6.1) 

i 

where  a*  G  R.  and  hi  :  X  — >■  V . 

We  will  use  the  same  loss  function  optimization  setting  as  functional  gradient  methods  as  well. 
Recall  that  we  wish  to  minimize  some  objective  functional  TZ: 

niin  TZ  [/] . 

Typically,  TZ  is  some  pointwise  loss,  evaluated  over  training  data 

N 

W]  =  (6.2) 

n—  1 

but  as  discussed  in  Chapter  2,  a  number  of  other  objective  functionals  are  possible. 

We  will  model  the  time  budget  portion  of  the  anytime  setting  as  a  budget  constraint,  in  the 
same  manner  as  the  budgeted  maximization  problems  discussed  in  Chapter  4  and  Chapter  5.  We 
assume  that  each  weak  predictor  h  has  an  associated  measure  of  complexity,  or  cost,  r(h)  where 
t  :  TL  — >  R.  This  measure  of  complexity  allows  for  weak  predictors  which  trade  accuracy  for 
computational  efficiency  and  vice  versa. 
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For  the  case  where  each  predictor  h  can  have  variable  computational  cost  per  example,  such  as 
a  decision  tree,  we  use  the  expected  computation  time.  Let  rx(h)  be  the  cost  of  evaluating  h  on 
example  x.  Then: 

r{h)  =  E  X[rx(h)\. 

We  further  assume  that  calculating  the  sum  of  the  predictor  outputs  weighted  by  at  takes  neg¬ 
ligible  computation,  so  that  the  total  computation  time  is  dominated  by  the  computation  time  of 
each  predictor.  Using  this  complexity  measure  we  can  describe  the  predictions  generated  at  a  given 
time  T  as 


f(T)  =  22  otihi(x ),  i*  =  max  l  i! 


i—  1 


£  T(h,)  <  T 


i=  1 


In  the  anytime  setting,  we  will  then  attempt  to  optimize  the  performance  7Z  [ftj) ]  for  all  budgets  T. 


6.3  SpeedBoost 

We  now  consider  learning  algorithms  for  generating  anytime  predictors.  Formally,  given  a  set  of 
weak  predictors  H  we  want  to  find  a  sequence  of  weights  and  predictors  such  that 

the  predictor  /  constructed  in  Equation  (6.1)  achieves  good  performance  lZ[f(j)\  at  all  possible 
stopping  times  T. 

Recall  the  properties  of  interruptability,  monotonicity,  and  diminishing  returns  that  are  desir¬ 
able  in  the  anytime  setting.  An  ensemble  predictor  as  formulated  in  Section  6.2  naturally  satisfies 
the  interruptability  property,  by  evaluating  the  weak  predictors  in  sequence  and  stopping  when 
necessary  to  output  the  final  prediction. 


Algorithm  6.1  SpeedBoost 


Given:  starting  point  /0,  objective  7 Z 
for  i  =  1, ...  do 

Let  hi,  oil  =  axgmaxfceWjaeR 
Let  fj  fi—i  T  oiihi. 


}-K{fi-i+ah]} 

r{h) 


end  for 

return  Predictor  ({(hi,  «;)}*) 


To  learn  predictors  which  satisfy  the  last  two  properties,  we  present  SpeedBoost  (Algorithm 
6.1),  a  natural  greedy  selection  approach  for  selecting  weak  predictors.  This  algorithm  uses  a  cost- 
greedy  selection  procedure  to  select  the  weak  learner  h  which  gives  the  largest  gain  in  objective  7 Z 
per  unit-cost: 


arg  max 
h&i,aes. 


r(h) 


(6.3) 
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It  can  be  shown  that  the  non-cost  based  version  of  this  optimization, 

argmax  \R[fi-i\  -  +  ah]] ,  (6.4) 

heH,ael R 

is  the  same  optimization  performed  by  many  boosting  algorithms  implicitly.  In  many  functional 
gradient  methods,  we  select  the  predictor  which  maximizes 


(V,  h) 

argmax——, 


as  per  the  discussion  in  Chapter  2.  This  optimization  is  equivalent  to  minimizing  Equation  (6.4) 
for  many  losses.  For  example,  in  squared  error  regression,  or  in  the  exponential  loss  optimization 
of  AdaBoost  [Freund  and  Schapire,  1997],  the  two  are  equivalent. 

As  we  will  discuss  later  in  Section  6.4,  this  algorithm  is  very  similar  to  the  Forward  Regression 
algorithm  discussed  in  the  previous  chapter  (Algorithm  5.1),  with  one  exception.  In  this  algorithm, 
the  weight  is  optimized  only  over  the  newly  added  element  hi,  while  in  Forward  Regression,  the 
weights  are  re-optimized  over  all  selected  variables. 

SpeedBoost  will  select  a  sequence  of  feature  functions  h  that  greedily  maximize  the  im¬ 
provement  in  the  algorithm’s  prediction  per  unit  time.  By  using  a  large  set  TL  of  different  types 
of  weak  predictors  with  varying  time  complexity,  this  algorithm  provides  a  simple  way  to  trade 
computation  time  with  improvement  in  prediction  accuracy.  Unfortunately,  for  many  classes  of 
functions  where  TL  is  very  large.  Algorithm  6.1  can  be  impractical.  Furthermore,  unlike  in  regular 
boosting,  where  the  projection  operation  can  be  implemented  as  an  efficient  learning  algorithm, 
the  cost-greedy  selection  criteria  in  Equation  (6.3)  is  not  so  easily  optimizable. 

To  address  this  issue,  we  use  the  weak  learner  selection  methods  of  functional  gradient  descent 
and  other  boosting  methods.  As  shown  in  Chapter  2,  we  can  implement  an  efficient  gradient 
projection  operation  to  select  good  candidate  weak  predictors. 

Recall  from  Section  2.2  that,  given  a  function  V  representing  the  functional  gradient  (Sec¬ 
tion  2.1.2),  the  projection  of  V  on  to  a  set  of  weak  predictors  TL  is  defined  using  the  functional 
inner  product, 

Proj  (N ,TL)  =  argmax  ^ 


=  argmax 
hen 


Eli  h(xny 


(6.5) 


For  classifiers  with  outputs  in  h(x)  G  {  —  1,  +1},  Equation  (6.5)  is  simply  a  weighted  classifica¬ 
tion  problem.  Equivalently,  when  TL  is  closed  under  scalar  multiplication,  the  projection  rule  can 
minimize  the  norm  in  function  space, 


Proj  (V  ,TL) 


argmin  ||  V  —  h\\2 
hen 


arg  min 
hen 


N 

-V(xn))2, 

n=  1 


(6.6) 
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which  corresponds  directly  to  solving  the  least  squares  regression  problem. 


Algorithm  6.2  SPEEDBoosT.Proj 


Given:  starting  point  /0,  objective  7 2 
for  i  =  1, ...  do 

Compute  gradient  V,;  =  V72[/j_ \ 1. 
Let  H*  =  {h*  |  h*  =  Proj  iV.-.'H;)}. 


Let  hi,  a{  =  arg  max^eH%aeM 
Let  fi  fi—i  T  otihi. 


r{h) 


end  for 

return  Predictor  ({(hi,  «;)}•) 


Algorithm  6.2  gives  a  more  tractable  version  of  SpeedBoost  for  learning  anytime  predictors 
based  on  the  projection  strategy  of  functional  gradient  descent.  Here  we  assume  that  there  exist  a 
relatively  small  number  of  weak  learning  algorithms,  {Hi,  H2,  •  •  •}  representing  classes  of  func¬ 
tions  with  similar  complexity.  Lor  example,  the  classes  may  represent  decision  trees  of  varying 
depths  or  kernel-based  learners  of  varying  complexity.  The  algorithm  first  projects  the  functional 
gradient  onto  each  individual  class  as  in  gradient  boosting,  and  then  uses  the  best  result  from  each 
class  to  perform  the  greedy  selection  process  described  previously. 

Algorithm  6.3  SpeedBoost.MP 

Given:  starting  point  /0,  objective  72 

for  i  —  1, ...  do 

Compute  gradient  V,;  =  V72[/j_  1 1. 

Let  H*  =  {h*  |  h*  =  Proj  (Vt,^)}. 

Let  hi  =  argmax^* 

Let  at  =  argminQgM  72 +  ahj] 

Let  fi  fi—i  T 

end  for 

return  Predictor  ( { ( hi ,  ai ) }  ■ ) 


This  modification  to  the  greedy  algorithm  is  closely  related  to  another  greedy  feature  selection 
algorithm.  Algorithm  6.3  gives  a  complexity-weighted  version  of  Matching  Pursuit  [Mallat  and 
Zhang,  1993],  adapted  to  function  spaces.  In  this  algorithm,  we  use  the  selection  criteria 


arg  max 
hen* 


(Vi,h)2 

r(h) 


to  select  the  next  weak  predictor  at  each  iteration.  This  selection  criteria  is  equivalent  to  the  criteria 
used  in  Orthogonal  Matching  Pursuit  [Pati  et  al.,  1993],  discussed  in  Chapter  5,  but  the  algorithm 
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only  fits  the  weight  at,  instead  of  refitting  the  weights  on  all  of  the  weak  learners.  In  the  next 
section  we  will  discuss  the  OMP  equivalent  for  this  algorithm,  and  the  theoretical  guarantees  that 
can  be  derived  for  that  algorithm  (Algorithm  6.5). 

In  practice  we  use  Algorithm  6.2  in  favor  of  Algorithm  6.3  because  the  linesearch  and  loss 
evaluation  required  for  that  selection  criteria  is  typically  not  significantly  more  expensive  than  the 
Matching  Pursuit  selection  criteria.  In  the  SpeedBoost.MP  variant,  we  will  require  a  linesearch 
anyway  after  the  next  weak  predictor  is  selected,  so  computing  it  for  each  complexity  class  is 
not  a  problem.  Furthermore,  the  projection  of  the  gradient  on  to  each  complexity  class  TLj  is 
typically  much  more  expensive  than  the  linesearch  required  to  optimize  the  selection  criteria  in 
SPEEDBooST.Proj,  so  that  portion  of  the  optimization  does  not  contribute  significantly  to  the 
training  time. 

6.4  Theoretical  Guarantees 


Algorithm  6.4  SpeedBoost.FR 


Given:  starting  point  /0,  objective  TZ 
for  i  =  1, ...  do 


Let  hi,  oti 


arg  max 

he'H.aeK* 


\nfi- i]-ftE‘-=i 

r(h) 


ajhj+aih ]] 


Let  fi  i  cx-ijhj. 

end  for 

return  Predictor  ({(/rj, «,;)}. =1) 


Algorithm  6.5  SpeedBoost.OMP 
Given:  starting  point  /0,  objective  TZ 
for  i  =  1, ...  do 

Compute  gradient  V,;  =  V7£[/j_i]. 

Let  H*  =  {h*  |  h*  =  Proj  (V 

Let  hi  =  arg  max  . 

hen*  T{h) 

Let  a,  =  arg  min  TZ['^'jZ=\  otjhj  +  ajhi]. 

Let  fi  i  oiijhj. 

end  for 

return  Predictor  ({(/i*,  CKj)}-=1) 


We  will  now  analyze  a  variant  of  the  SpeedBoost  algorithm  and  prove  that  the  predictor 
produced  by  this  algorithm  is  near  optimal  with  respect  to  any  sequence  of  weak  predictors  that 
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Figure  6.1:  Test  set  error  as  a  function  of  complexity  for  the  UCI  ‘pendigits’  dataset,  comparing  Speed- 
Boost.MP  (Algorithm  6.3)  (black  dashed  line)  to  SpeedBoost.OMP  (Algorithm  6.5)  (solid  red  line). 


could  be  computed  in  the  same  amount  of  time,  for  a  common  set  of  loss  functions  and  certain 
classes  of  weak  predictors. 

For  analysis,  one  can  interpret  Algorithm  6.3,  and  to  some  degree  Algorithm  6.2,  as  a  time- 
based  version  of  Matching  Pursuit  [Mallat  and  Zhang,  1993].  Unfortunately,  the  sequence  of  weak 
predictors  selected  by  matching  pursuit  can  perform  poorly  with  respect  to  the  optimal  sequence 
for  some  fixed  time  budget  T  when  faced  with  highly  correlated  weak  predictors  [Pati  et  al.,  1993]. 
A  modification  of  the  Matching  Pursuit  algorithm  called  Orthogonal  Matching  Pursuit  [Pati  et  al., 
1993]  addresses  this  flaw. 

As  discussed  previously,  the  main  different  between  the  MP  and  OMP  approach  is  the  behavior 
of  the  weight  fitting  at  each  iteration.  In  SpeedBoost.MP  we  fit  the  only  the  weight  a*  on  weak 
predictor  ht  at  each  iteration.  In  the  OMP  approach  and  in  SpeedBoost.OMP,  we  refit  all  the 
weights  a  on  each  weak  predictor  at  each  iteration.  Similarly,  the  basic  SpeedBoost  algorithm 
(Algorithm  6.1)  differs  only  from  Forward  Regression  in  this  same  weight  refitting  aspect. 

To  that  end,  we  present  SpeedBoost. FR  (Algorithm  6.4)  and  SpeedBoost.OMP  (Algo¬ 
rithm  6.5)  which  are  modifications  of  the  SpeedBoost  (Algorithm  6.1)  and  SpeedBoost.MP 
(Algorithm  6.3),  respectively.  The  key  difference  between  these  algorithms  is  the  refitting  of  the 
weights  on  every  weak  predictor  selected  so  far  at  every  iteration  of  the  algorithm. 

The  key  disadvantage  of  using  this  algorithm  in  practice  is  that  the  output  of  all  previous  weak 
predictors  must  be  maintained  and  the  linear  combination  re-computed  whenever  a  final  prediction 
is  desired. 

In  practice,  we  found  that  SpeedBoost  and  SpeedBoost.MP  performed  nearly  as  well  as 
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Algorithm  6.5  in  terms  of  improvement  in  the  objective  function,  while  being  significantly  cheaper 
to  implement.  We  did  not  test  SpeedBoost.FR,  because  it  is  intractable  in  practice  for  the  boost¬ 
ing  setting.  To  properly  implement  it  would  require  enumerating  all  possible  weak  predictors  in 
a  given  set  and  running  a  full  weight  optimization  over  all  previous  weak  learners  for  each  one, 
which  would  be  prohibitively  expensive. 

Figure  6.1  shows  a  comparison  of  the  test  error  on  the  UCI  ‘covertype’  dataset  for  Speed- 
BOOST.MP  and  SpeedBoost.OMP.  In  this  case,  while  the  training  objective  performances  were 
nearly  indistinguishable  (not  shown),  Algorithm  6.5  overfit  to  the  training  data  much  more  rapidly. 

6.4.1  Uniformly  Anytime  Near-Optimality 

Now  that  we  have  these  function  space  equivalents  of  the  FR  and  OMP  algorithms,  we  can  use  the 
results  in  Chapter  5  to  obtain  approximation  guarantees  on  them. 

Assume  that  7 Z  is  the  pointwise  loss  given  in  Equation  (6.2).  Let  S  be  a  set  of  selected  weak 
predictors  S  C  H,  and  let  fs  be  the  linear  combination  of  those  selected  weak  predictors  which 
minimizes  1Z: 

fs  =  ^2  a*  hi 

i&S  ^  (6-7) 

a*  =  arg  min  7Z  [  oiihi]  • 

OL  • 

ies 

Then,  we  can  define  a  set  function  equivalent  of  minimizing  the  objective  1Z  as 

Fn(S)  =  E*[^(0)]  -  min  Ex[£( V] ajh^x))} 

“eR|51  7^ 

—  7Z[0\  —  TZ[fs\- 

We  have  now  reduced  the  anytime  framework  and  SpeedBoost  approach  described  above 
to  the  sparse  approximation  problem  analyzed  in  Chapter  5.  It  can  further  be  shown  that  Speed¬ 
Boost.FR  (Algorithm  6.4)  and  SpeedBoost.OMP  (Algorithm  6.5)  are  exactly  equivalent  to 
the  Forward  Regression  and  Orthogonal  Matching  Pursuit  algorithms  previously  analyzed  for  the 
sparse  approximation  problem. 

Using  this  reduction,  we  can  apply  all  the  results  derived  in  Chapter  4  and  Chapter  5  to  our 
anytime  prediction  setting  and  boosting  framework.  For  example,  consider  the  regularized  variant 
of  the  sparse  approximation  reduction  given  above: 

Fn(S)  =  E*[£(0)]  -  min  Ex[£(S2aihi(x)) +  ^aTa\.  (6.8) 

aeRl5!  2 

teS 

This  is  equivalent  to  using  a  modified  optimization  problem  where  the  weights  on  the  weak 


6.4.  THEORETICAL  GUARANTEES 


117 


learners  are  regularized: 


min  77[/]  +  —  aTct 

f  ^  ^ 


We  can  now  apply  the  bounds  for  the  regularized,  smooth  sparse  approximation  problem  to  this 
setting  and  get  a  guarantee  for  the  performance  of  our  SpeedBoost  variants. 

Theorem  6.4.1  (Uniformly  Anytime  Approximation  Guarantee).  Let  F-r  be  the  regularized  ver¬ 
sion  of  the  anytime  problem  given  in  Equation  (6.8),  with  regularization  parameter  A.  Assume 
that  the  weak  predictors  in  TL  all  have  bounded  norm  j  |  h  \  <  1.  Let  loss  I  be  an  M -strongly 
smooth  functional.  Let  S  be  any  sequence  of  elements  in  TL.  Let  7  =  jpfy.  Algorithm  6.4  selects 

a  sequence  of  weak  predictors  G  =  { h.,  |  ht  e  TL. } ,  such  that  for  any  time  T  =  Yl\=i 

Fk(G{t))  >  (l  —  e-7)  Fti(S(t)), 

and  Algorithm  6.5  selects  a  sequence  of  weak  predictors  6"  =  { //'  \  hli  G  'H ) ,  such  that  for  any 
time  T  =  Eti 

FrAgy,)  >  (l  -  e-f  Fn(S,T.)). 


The  proof  is  a  direct  application  of  the  bounds  in  the  previous  chapters  to  the  sparse  approxi¬ 
mation  reduction  we’ve  detailed  above. 

Theorem  6.4.1  states  that,  for  all  times  T  that  correspond  to  the  computation  times  that  weak 
learners  selected  by  SpeedBoost. FR  and  SpeedBoost. OMP  update  their  prediction,  the  re¬ 
sulting  improvement  in  loss  1Z  is  approximately  as  large  as  any  other  sequence  of  weak  learners 
that  could  have  been  computed  up  to  that  point.  This  means  that  the  anytime  predictor  generated 
by  those  algorithms  is  competitive  even  with  sequences  specifically  targeting  fixed  time  budgets, 
uniformly  across  all  times  at  which  the  anytime  predictor  computes  new  predictions. 

Other  results  from  previous  chapters  could  also  be  easily  extended  to  this  setting  using  the 
above  sparse  approximation  reduction.  For  example,  the  doubling  algorithm  discussed  in  Sec¬ 
tion  4.4  for  obtaining  a  bi-criteria  approximation  with  respect  to  any  arbitrary  budget  T  could  be 
adapted  to  SpeedBoost  and  its  variants  to  obtain  the  same  bi-criteria  approximation  guarantees 
for  arbitrary  time  budgets. 
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6.5  Experimental  Results 

6.5.1  Classification 

Our  first  application  is  a  set  of  classification  problems  from  the  UCI  Machine  Learning  Repository 
[Frank  and  Asuncion,  2010].  We  use  the  multiclass  extension  [Mukherjee  and  Schapire,  2010]  to 
the  exponential  loss 

4 (fM)  =  5^  exP (fMk  ~  fMyJ. 

For  weak  predictors  we  use  decision  trees  of  varying  depth  up  to  20  nodes  deep.  We  use 
Algorithm  6.2  and  the  weighted  classification  form  of  gradient  projection  to  select  the  sequence  of 
trees  for  our  anytime  prediction  algorithm. 


Figure  6.2:  Test  set  error  as  a  function  of  prediction  time  for  the  UCI  ‘pendigits’  (top)  and  ‘covertype’ 
(bottom)  dataset.  The  algorithms  shown  are  SPEEDBooST.Proj  (black  dashed  line),  and  AdaBoost.MM 
[Mukherjee  and  Schapire,  2010]  (red  solid  line). 


As  a  point  of  comparison  we  use  the  AdaBoost.MM  [Mukherjee  and  Schapire,  2010]  imple¬ 
mentation  of  multiclass  boosting  on  the  same  set  of  trees.  AdaBoost,  when  used  in  this  manner 
to  generate  an  anytime  predictor,  is  effectively  a  variant  on  the  greedy  selection  algorithm  (Algo¬ 
rithm  6.1)  which  does  not  consider  the  computation  time  r(h)  of  the  individual  hypotheses. 

Figure  6.2  shows  the  performance  of  our  algorithm  and  AdaBoost  as  a  function  of  the  average 
number  of  features  accessed  per  example.  On  these  problems,  the  SpeedBoost  generated  predic¬ 
tor  finds  a  reasonable  prediction  using  fewer  features  than  the  AdaBoost  alternative  and  remains 
competitive  with  AdaBoost  as  time  progresses. 
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6.5.2  Object  Detection 

Our  second  application  is  a  vehicle  detection  problem  using  images  from  onboard  cameras  on  a 
vehicle  on  public  roads  and  highways  under  a  variety  of  weather  and  time-of-day  conditions.  The 
positive  class  includes  all  vehicle  types,  e.g.,  cars,  trucks,  and  vans.  Negative  examples  are  drawn 
from  non-vehicle  regions  of  images  taken  from  the  onboard  cameras. 

Prediction  on  Batch  Data 

In  the  previous  application  we  consider  weak  predictors  which  will  run  for  roughly  the  same 
amount  of  time  on  each  example  x  and  care  about  the  performance  of  the  learned  predictor  over 
time  on  a  single  example.  In  many  settings,  however,  we  often  care  about  the  computational  re¬ 
quirements  of  a  predictor  on  a  batch  of  examples  as  a  whole.  For  example,  in  ranking  we  care  about 
the  computation  time  required  to  get  an  accurate  ranking  on  a  set  of  items,  and  in  computer  vision 
applications  many  examples  from  a  video  or  image  are  often  processed  simultaneously.  Another 
way  to  view  this  problem  is  as  a  simplified  version  of  the  structured  prediction  problem  where  the 
goal  is  to  make  predictions  on  all  pixels  in  an  image  simultaneously. 

In  these  settings,  it  is  often  beneficial  to  allocate  more  computational  resources  to  the  diffi¬ 
cult  examples  than  the  easy  examples  in  a  batch,  so  extra  resources  are  not  wasted  improving 
predictions  on  examples  that  the  algorithm  already  has  high  confidence  in.  In  computer  vision, 
in  particular,  cascades  [Viola  and  Jones,  2001]  are  a  popular  approach  to  improving  batch  pre¬ 
diction  performance.  These  prediction  algorithms  decrease  the  overall  complexity  of  a  predictor 
by  periodically  filtering  out  and  making  final  predictions  on  examples,  removing  them  from  later 
prediction  stages  in  the  algorithm. 

We  can  use  our  anytime  framework  and  algorithms  to  consider  running  each  weak  predictor  on 
subsets  of  the  data  instead  of  every  example.  Given  a  set  of  weak  predictors  TL  to  optimize  over, 
we  can  create  a  new  set  of  predictors  TL'  by  introducing  a  set  of  filter  functions  ©  e  <3>: 

4>  :  X  — *  (0, 1}, 

and  considering  the  pairing  of  every  filter  function  and  weak  predictor 

W  =  $  x  n 

h!(x)  =  4>(x)h(x). 

These  filters  (f>  represent  the  decision  to  either  run  the  weak  predictor  h  on  example  x  or  not.  Unlike 
cascades,  these  decisions  are  not  permanent  and  apply  only  to  the  current  stage.  This  property  very 
nicely  allows  the  anytime  predictor  to  quickly  focus  on  difficult  examples  and  gradually  revisit  the 
lower  margin  examples,  whereas  the  cascade  predictor  must  be  highly-confident  that  an  example 
is  correct  before  halting  prediction  on  that  example. 

Assuming  that  the  filter  function  is  relatively  inexpensive  to  compute  compared  to  the  compu¬ 
tation  time  of  the  predictor,  the  new  complexity  measure  for  predictors  b!  is 


rib!)  =  Ex  [4>(x)Tx(h)\ , 
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or  the  expected  computation  time  of  the  original  predictor,  only  on  image  patches  which  are  not 
filtered  out  by  the  filter  function  (f>. 


Implementation 


Figure  6.3:  Test  set  error  for  the  vehicle  detection  problem  as  a  function  of  the  average  number  of  features 
evaluated  on  each  image  patch. 

Similar  to  previous  work  in  object  detection,  we  use  Haar-like  features  computed  over  image 
patches  for  weak  predictors.  We  search  over  margin-based  filter  functions  (f>,  such  that  the  filters  at 
stage  i  are 

Mx)  =  l(|/i-i(aOI  <  O), 

leveraging  the  property  that  examples  far  away  from  the  margin  are  (with  high  probability)  already 
correcly  classified. 

Computing  these  filters  at  test-time  can  be  made  relatively  efficient  in  two  ways.  First,  by 
storing  examples  in  a  priority  queue  sorted  by  current  margin,  the  updates  and  filtering  at  each 
stage  can  be  made  relatively  cheap.  Second,  after  learning  the  anytime  predictor  using  Algorithm 
6.2,  all  future  filters  are  known  at  each  stage,  and  so  the  predictor  can  quickly  determine  the  next 
stage  an  example  will  require  computation  in  and  handle  the  example  accordingly. 

We  compare  against  a  cascade  implementation  for  this  detection  dataset  which  uses  the  stan¬ 
dard  AdaBoost  algorithm  for  an  inner  loop  learning  algorithm.  Figure  6.3  gives  the  error  on  a  test 
dataset  of  10000  positive  and  50000  negative  examples  as  a  function  of  computation  time.  In  this 
setting  the  cascade  is  at  a  significant  disadvantage  because  it  must  solidly  rule  out  any  negative 
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Figure  6.4:  Fraction  of  data  updated  by  each  iteration  for  the  cascade  (dashed  blue  line)  and  anytime 
predictor  (solid  red  line). 


examples  before  classifying  them  as  such,  while  the  AdaBoost  and  anytime  predictors  can  initially 
declare  all  examples  negative  and  proceed  to  adjust  prediction  on  positive  examples. 

To  further  illustrate  the  large  benefit  to  being  able  to  ignore  examples  early  on  and  revisit  them 
later.  Figure  6.4  gives  a  per  iteration  plot  of  the  fraction  of  test  data  updated  by  each  corresponding 
feature.  This  demonstrates  the  large  culling  early  on  of  examples  that  allows  the  anytime  predic¬ 
tor  to  improve  performance  much  more  rapidly.  Finally,  Figure  6.5  displays  the  ROC  curve  for 
the  anytime  predictor  at  various  complexity  thresholds  against  the  ROC  curve  generated  by  the 
final  cascade  predictions  and  Figure  6.6  shows  the  visual  evolution  of  the  cascade  and  anytime 
predictions  on  a  single  test  image. 


6.5.3  Budgeted  Feature  Selection 

Our  final  application  for  the  anytime  prediction  framework  is  in  the  feature  selection  domain.  In 
this  setting,  we  assume  that  the  examples  x  do  not  have  precomputed  features,  and  that  the  total 
prediction  time  T  is  dominated  by  the  computation  time  of  the  features  of  x  required  by  our 
predictor.  The  goal,  therefore,  is  to  select  a  sequence  of  weak  predictors  h  which  only  use  a  subset 
of  the  features  for  a  given  example  x.  At  any  given  point  we  want  to  have  selected  the  most  efficient 
subset  of  features  to  obtain  good  anytime  performance. 

This  is  the  same  budgeted  feature  selection  setting  as  Xu  et  al.  [2012]  and  Xu  et  al.  [2013a]. 
To  handle  this  setting,  we  will  have  to  slightly  augment  our  cost  model,  to  allow  the  cost  of  a  weak 
predictor  to  be  dependent  on  previously  selected  ones. 
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Figure  6.5:  ROC  curves  for  the  final  cascade  predictions  and  anytime  algorithm  predictions  at  various 
computation  thresholds.  Computation  is  measured  using  the  average  number  of  features  computed  on  each 
patch. 

Dependent  Weak  Learner  Costs 

In  the  anytime  prediction  framework  we  presented  in  Section  6.2  we  made  the  assumption  that 
weak  predictors  have  some  fixed  cost  r(h).  However,  in  the  budgeted  feature  selection  model, 
features  only  incur  computation  costs  the  first  time  they  are  used.  This  implies  that  the  cost  of  a 
new  weak  predictor  is  dependent  on  which  features  have  been  selected  so  far,  and  hence  which 
predictors  have  been  selected  so  far. 

To  handle  this  case,  we  now  introduce  a  weak  predictor  cost  which  is  conditioned  on  the  pre¬ 
viously  selected  weak  predictors.  When  evaluating  the  cost  of  a  weak  predictor  after  selecting  t 
weak  predictors  already,  the  cost  would  be 

r(h\hi, ht ), 

and  the  total  predictor  cost  would  be 

r{f )  =  ^2r(ht\h1,...,ht-i). 

t 

In  the  budgeted  feature  selection  setting  we  are  given  a  set  of  features  0  G  $  and  we  assume 
each  feature  has  some  computation  or  acquisition  cost  c^.  We  also  assume  that  there  is  some 
small  fixed  cost  for  each  predictor  used,  which  represents  the  computational  cost  of  evaluating  that 
predictor,  after  all  features  have  been  computed.  For  example,  this  might  be  the  cost  of  evaluating 
a  decision  tree  once  all  features  are  known.  We  represent  this  fixed  cost  as  Ch  for  predictor  h. 
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Using  the  notation  (f>  E  f  to  represent  that  feature  o  is  used  by  predictor  /,  we  can  write  the 
conditional  cost  for  the  budgeted  feature  selection  problem  as 

r{h\hi,  ...,ht)=ch  +  G  h)  1(0  0  ht). 

t 

As  desired,  this  definition  of  cost  only  incurs  a  penalty  for  the  first  time  a  feature  is  computed. 
After  that,  the  use  of  a  feature  is  free,  except  for  any  fixed  costs  incurred  in  computing  the  predictor, 
captured  by  the  cost  Ch- 

The  total  computation  time  of  a  predictor  /  then  reduces  exactly  as  we’d  expect  to 

T(f)  =  S^Jcht  +  y^jc(j>l{(t)E  /). 

t  cf, 

To  modify  our  anytime  prediction  algorithms  for  this  setting  we  simply  augment  the  greedy 
selection  step  in  each  algorithm  to  use  the  conditional  cost  instead  of  a  fixed  cost.  This  conditional 
cost  requirement  breaks  the  assumptions  required  for  the  theoretical  guarantees  in  Section  6.4,  but 
in  practice  the  performance  appears  to  be  similar  to  the  results  we  see  when  examining  settings 
with  fixed  costs. 

Cost- Regularized  Regression  Trees 

In  this  domain  we  want  to  generate  weak  predictors  h  that  incur  a  variety  of  costs,  by  using  different 
mixtures  of  already  computed  and  new  features.  Ideally,  the  weak  predictors  would  also  use  new 
features  with  a  variety  of  costs,  to  explore  the  possibilities  of  using  cheap  and  expensive  features. 

To  achieve  this,  we  use  the  weak  predictor  proposed  by  Xu  et  al.  [2012]  in  their  Greedy  Miser 
algorithm.  Their  weak  predictor  is  based  on  a  regression  tree  framework,  but  modifies  the  regres¬ 
sion  tree  split  function,  also  known  as  the  impurity  function,  with  a  cost-based  regularizer.  Assume 
we  are  at  iteration  t  +  1,  and  have  already  learned  a  predictor  /,  .  The  Greedy  Miser  weak  predictor 
selects  node  splits  using  an  impurity  function  g  which  optimizes  the  cost-regularized  squared  error 

1  N 

9(h)  =  g  5Z  II  _  h(xn)\\2  +  *r(h\ft), 

n= 1 

where  A  is  a  regularization  parameter  which  trades  cost  and  accuracy  of  the  learned  predictor. 

In  the  Greedy  Miser  approach,  a  fixed  A  is  chosen  for  all  weak  predictors,  and  functional 
gradient  boosting  proceeds  as  normal,  using  the  cost-regularized  weak  learning  algorithm.  Using 
different  values  of  A  produces  different  points  on  the  cost  and  accuracy  trade-off  spectrum.  In 
our  approach,  we  will  use  SpeedBoost  to  optimize  simultaneously  over  weak  predictors  learned 
with  all  the  different  values  of  A.  Specifically,  we  will  use  SPEEDBooST.Proj  (Algorithm  6.2)  to 
generate  the  best  regression  tree  for  each  value  of  A  in  a  pre-selected  set  of  possible  values,  and 
then  select  from  among  these  candidate  trees  using  the  cost-greedy  SpeedBoost  criteria. 
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Experiments 

Our  first  problem  in  this  domain  is  the  Yahoo!  Learning  to  Rank  Challenge  data,  augmented  with 
feature  computation  costs  [Xu  et  al.,  2012].  The  dataset  consists  of  a  set  of  documents  paired  with 
relevance  scores  in  {0, 1,  2, 3, 4},  with  0  representing  a  completely  irrelevant  result  and  a  score  of  4 
indicating  high  relevance.  The  document,  relevance  pairs  are  then  grouped  by  query,  representing 
the  groups  within  which  the  results  are  to  be  ranked.  The  dataset  consists  of  473134  training 
documents,  along  with  71083  and  165660  validation  and  testing  documents. 

The  features  for  the  data  are  drawn  from  a  variety  of  sources,  with  costs  drawn  from  the  set 
c#  G  {1,  5, 10,  20,  50, 100, 150,  200}.  We  additionally  use  a  fixed  cost  Ch  =  1  for  each  tree  in  this 
problem. 

For  learning  purposes  we  model  the  problem  as  a  regression  problem  using  squared  error  with 
the  relevance  scores  for  targets.  In  practice,  the  Normalized  Discounted  Cumulative  Gain  (NDCG) 
[Jarvelin  and  Kekalainen,  2002]  is  used  to  measure  actual  ranking  performance,  but  we  cannot 
optimize  this  metric  directly. 

We  compare  to  the  Greedy  Miser  approach  [Xu  et  al.,  2012]  for  a  variety  of  values  of  A  G 
(0,  0.5, 1,  2, 4, 10}.  We  use  the  same  set  of  A  values  and  the  regularized  regression  tree  training 
detailed  above  to  generate  candidate  weak  predictors  for  the  SPEEDBooST.Proj  algorithm.  Ad¬ 
ditionally,  for  A  =  0,  Greedy  Miser  is  equivalent  to  the  standard  functional  gradient  approach. 
Although  Greedy  Miser  is  not  an  anytime  approach  per  se,  we  can  treat  the  predictor  produced  for 
a  given  fixed  value  of  A  as  an  anytime  predictor  in  the  same  way  we  treat  the  sequences  generated 
by  SpeedBoost. 

Figure  6.7  gives  the  training  performance,  in  the  form  of  mean  squared  error,  and  test  set  accu¬ 
racy  ,  in  the  form  of  NDCG  @  5,  for  the  Yahoo!  Learning  to  Rank  data.  The  anytime  performance 
of  each  individual  Greedy  Miser  sequence  is  plotted,  along  with  the  single  sequence  trained  by 
SpeedBoost. 

Looking  at  the  training  performance,  we  see  that  the  SpeedBoost  approach  is  very  close  to 
the  optimal  performance  at  any  given  computational  budget  with  respect  to  the  training  objective 
of  mean  squared  error.  For  test  set  performance,  the  NDCG  @  5  on  test  data  is  also  nearly  optimal, 
but  here  we  observe  a  small  increase  in  the  overfitting  of  the  cost-greedy  SpeedBoost  approach 
as  compared  with  the  Greedy  Miser  approach.  We  postulate  that  this  is  due  to  the  cost-greedy 
algorithm’s  tendency  to  re-use  features  in  order  to  get  smaller  gains,  due  to  their  very  low  cost  as 
compared  to  computing  new  features.  This  repeated  re-use  of  cheap  features  can  lead  to  increased 
overfitting,  particular  for  the  regression  trees  used  here. 

The  second  application  we  consider  is  the  Scene- 15  scene  categorization  dataset.  This  dataset 
consists  of  4485  images  grouped  in  to  15  classes  describing  the  contents  of  the  scene.  Example 
classes  include:  highway,  office,  forest  and  coast.  Since  this  is  a  multiclass  classification  task,  we 
use  the  softmax,  or  cross-entropy  loss,  which  is  the  mutliclass  generalization  of  the  logistic  loss. 

We  utilize  the  same  features  and  training  procedure  as  Xu  et  al.  [2012].  A  total  of  1500  images 
are  sampled  (100  from  each  class)  and  used  as  training  data,  300  are  used  as  validation  and  the 
remaining  2685  are  used  as  test  data.  For  each  image,  184  different  feature  descriptors  are  com- 
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puted  using  a  variety  of  methods,  such  as  Local  Binary  Patterns  and  spatial  HOG  features.  Each 
separate  descriptor  is  then  used  to  train  15  one-vs-all  SVMs  on  30%  of  the  training  data.  Finally, 
the  predictions  of  the  trained  SVMs  are  used  as  input  features  for  the  anytime  learner,  with  the 
remaining  70%  of  the  training  data  being  used  for  training  the  anytime  predictor.  The  end  result 
is  a  set  of  1050  training  examples  and  184  x  15  =  2070  input  features  for  the  functional  gradient 
training  procedure. 

The  feature  costs  in  this  case  are  derived  from  the  cost  of  evaluating  the  underlying  feature 
descriptor.  Each  feature  corresponds  to  a  specific  one-vs-all  SVM  computed  on  one  particular 
feature  descriptor  with  a  particular  computational  cost,  represented  as  the  time  to  compute  that 
descriptor  for  the  average  image.  In  this  particular  case,  because  multiple  SVMs  are  trained  on 
each  feature  descriptor,  we  actually  have  costs  for  groups  of  features.  That  is,  a  weak  predictor  h 
only  incurs  cost  for  feature  (f)  if  no  features  ft  that  are  in  the  same  group  as  (j)  have  been  computed 
yet.  Computing  a  feature  in  a  group  makes  all  the  other  features  have  an  effective  cost  of  0,  because 
the  computational  cost  is  fixed  for  the  entire  descriptor  and  all  features  derived  from  it. 

In  this  setting  there  are  significantly  more  features  than  training  examples  (2070  vs.  1050).  This 
makes  it  highly  likely  that  the  regression  trees  used  as  weak  learners  will  overfit  to  the  training  data, 
even  when  using  a  small  subset  of  the  features.  In  practice  we  observe  that,  when  using  the  standard 
SpeedBoost  approach,  the  algorithm  significantly  overestimates  the  cost-greedy  gain  on  training 
data.  To  increase  robustness  in  this  setting,  we  use  a  sampling  approach  similar  to  Stochastic 
Gradient  Boosting  [Friedman,  1999].  At  each  iteration,  we  sample  90%  of  the  training  data  without 
replacement  and  use  this  training  data  for  gradient  projection,  i.e.  training  weak  predictors.  We 
then  evaluate  the  cost-greedy  gain  used  to  select  the  optimal  weak  predictor  in  SpeedBoost  on 
the  remaining  10%  of  the  data  that  was  held  out.  By  evaluating  the  cost-greedy  gain  on  held  out 
data,  we  compute  an  estimate  of  the  true  cost-greedy  gain  that  is  much  closer  to  the  behavior  on 
test  data. 

Figure  6.8  gives  the  same  comparison  of  training  objective  and  test  accuracy  for  different  com¬ 
putational  budgets  on  the  Scene- 15  dataset.  For  this  dataset  we  see  similar  behavior  as  the  Yahoo 
FTR  application.  Though  there  are  certain  budgets  for  which  the  best  fixed  Greedy  Miser  predictor 
outperforms  the  SpeedBoost  predictor,  overall  the  SpeedBoost  approach  is  nearly  as  good  as 
the  best  performing  predictor  for  a  given  budget.  Furthermore,  in  this  setting  the  fixed  predictors 
show  very  little  change  in  features  selected  over  time  and  largely  target  a  single  set  of  features  and 
hence  a  single  budget.  The  SpeedBoost  predictor,  in  constrast,  initially  uses  cheaper  features 
and  then  switches  to  expensive  features  when  doing  so  maximally  increases  the  gain.  Also  note 
that,  while  the  training  objective  can  continue  to  be  decreased  using  only  cheap  features,  as  the 
Greedy  Miser  predictor  for  A  =  4  shows,  the  sampling  strategy  ensures  that  the  predictor  switches 
to  using  expensive  features  when  doing  so  is  beneficial  on  validation  data. 
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Figure  6.6:  Images  displaying  the  performance  on  a  test  image  for  the  anytime  predictor  produced  by 
SpeedBoost  (top)  and  the  cascade  (bottom).  Displayed  on  the  top  of  each  image  are  the  activations,  or 
classification  probabilities,  for  that  algorithm.  In  the  middle  is  a  heat  map  of  the  number  of  features  evaluated 
by  that  predictor  for  each  pixel,  along  with  a  3D  visualization  of  this  same  statistic  along  the  bottom.  Images 
are  arranged  left  to  right  through  time,  at  intervals  of  7  average  feature  evaluations  per  pixel.  Note  that,  at 
this  scale,  the  cascade  still  has  most  of  its  effort  spread  over  the  entire  image,  and  so  the  heatmap  and  3D 
visualization  are  largely  flat. 


6.5.  EXPERIMENTAL  RESULTS 


127 


Figure  6.7:  Training  objective  (left)  and  test  set  accuracy  (right)  vs.  computational  cost  for  the  budgeted 
version  Yahoo!  Learning  to  Rank  Challenge  problem.  Provided  for  comparison  are  the  SPEEDBooST.Proj 
algorithm  along  with  Greedy  Miser  [Xu  et  al.,  2012]  for  a  variety  of  regularization  parameters  A.  For  A  =  0, 
the  Greedy  Miser  approach  is  equivalent  to  standard  functional  gradient  boosting. 


Figure  6.8:  Training  objective  (left)  and  test  set  accuracy  (right)  vs.  computational  cost  for  the  Scene- 
15  scene  categorization  dataset.  Provided  for  comparison  are  the  SPEEDBooST.Proj  algorithm  along  with 
Greedy  Miser  [Xu  et  al.,  2012]  for  a  variety  of  regularization  parameters  A.  For  A  =  0,  the  Greedy  Miser 
approach  is  equivalent  to  standard  functional  gradient  boosting. 
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Chapter  7 

StructuredSpeedBoost:  Anytime 
Structured  Prediction 


In  this  chapter  we  will  demonstrate  another  application  of  our  anytime  predicition  framework  to 
the  structured  prediction  setting,  specifically  to  the  scene  understanding  domain.  To  do  so,  we 
will  combined  the  cost-greedy  SpeedBoost  approach  detailed  in  the  previous  chapter  with  the 
structured  prediction  extensions  of  functional  gradient  methods  which  we  detailed  in  Chapter  3. 

7.1  Background 

We  will  first  briefly  review  the  structured  functional  gradient  previously  detailed  in  Chapter  3, 
specifically  Section  3.1.  For  more  details  on  the  structured  prediction  approach,  we  refer  the  reader 
to  that  chapter. 

Recall  the  structured  prediction  setting  previously  discussed  in  Section  3.1.  In  this  setting  we 
are  given  some  inputs  x  G  X  and  associated  structured  outputs  y  e  y.  The  goal  is  to  leam  a 
function  /  :  X  — >•  y  that  minimizes  some  risk  1Z  f\,  typically  evaluated  pointwise  over  the  inputs: 

n\f]=Ex[£(f(x))].  (7.1) 

We  also  assume  the  structured  outputs  are  representable  as  a  variable  length  vector  (yi, ... ,  yj), 
where  the  output  yj  represents  the  output  for  some  structural  element  of  the  total  output  y.  For 
example,  the  structural  elements  may  be  the  probability  distribution  over  class  labels  for  a  pixel  in 
an  image  or  the  current  prediction  for  a  node  in  a  graphical  model. 

We  also  assume  that  each  output  j  has  associated  with  it  some  set  N(j)  which  represents  the 
locally  connected  elements  of  the  structure  of  the  problem,  such  as  the  locally  connected  factors 
of  a  node  j  in  a  typical  graphical  model.  For  a  given  node  j,  the  predictions  over  the  neighboring 
nodes  'tjN(j)  and  other  features  of  the  local  structure  N(  j)  can  then  be  used  to  update  the  prediction 
for  the  node  j,  in  a  manner  similar  to  the  message  passing  approach  commonly  used  for  graphical 
model  prediction  [Pearl,  1988]. 
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As  we  did  before,  to  approach  such  a  structured  prediction  problem  we  will  be  using  an  ad¬ 
ditive,  functional  gradient  version  of  the  iterative  decoding  approach  [Cohen  and  Carvalho,  2005, 
Daume  III  et  al.,  2009,  Tu  and  Bai,  2010,  Socher  et  al.,  2011,  Ross  et  al.,  2011].  In  this  approach, 
we  are  going  to  leam  an  additive  structured  predictor, 


Kx^yl)j  =  Mj  e  &(,■)), 


(7.2) 


that  consists  of  two  main  components.  The  first,  a  selection  function  hs,  which  selects  some  subset 
of  the  structural  elements  to  update  at  each  iteration,  and  a  predictor  hP  which  runs  on  the  selected 
elements  and  updates  the  respective  pieces  of  the  structured  output. 

To  complete  the  structured  prediction  extension,  we  need  a  method  for  selecting  weak  pre¬ 
dictors  h  as  specified  in  Equation  (7.2).  Following  the  functional  gradient  approach  detailed  in 
Chapter  2,  and  extended  to  structured  prediction  setting  as  in  Section  3.1,  we  will  use  projected 
functional  gradients  for  this  purpose. 

There  is  typically  no  efficient  way  to  train  a  selection  function  and  predictor  simultaneously, 
so  we  will  instead  choose  selector,  predictor  pairs  by  first  enumerating  selection  functions  hs  and 
then  using  functional  gradient  methods  to  select  the  optimal  hP  for  the  chosen  selector. 

We  can  compute  a  functional  gradient  with  respect  to  each  element  of  the  structured  output, 


V(x)i 


9f{x)j 


Given  a  fixed  selection  function  hs  and  current  predictions  y,  the  functional  gradient  projection  for 
finding  the  optimal  weak  predictors  hP  is  as  follows.  In  order  to  minimize  the  projection  error  in 
Equation  (2.12)  for  a  predictor  h  of  the  form  in  Equation  (7.2),  we  only  need  to  find  the  prediction 
function  hP  that  minimizes 


/ip  =  argminE* 

/ipG'Hp 


j€hs(x,y) 


hP(xj,  yN(j)) 


(7.3) 


This  optimization  problem  is  equivalent  to  minimizing  weighted  least  squares  error  over  the 
dataset 

°  =  U  U  (fe.VWy)}, 

x  j£hs(x,y)  (7.4) 

=  gradient (/,  hs), 

where  'W3  =  U(xj,  ijN(j))  is  a  feature  descriptor  for  the  given  structural  node,  and  V(x)j  is  its 
target.  In  order  to  model  contextual  information,  t/j  is  drawn  from  both  the  raw  features  x:)  for  the 
given  element  and  the  previous  locally  neighboring  predictions  yN(j)- 

The  functional  gradient  algorithm  for  learning  these  additive  structured  predictors  was  given 
previously  in  Algorithm  3.1. 
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7.2  Anytime  Structured  Prediction 

We  now  combine  the  structured  functional  gradient  methods  developed  in  Chapter  3  with  the  any¬ 
time  prediction  techniques  developed  in  Chapter  6. 

Recall  that  in  the  anytime  setting  we  have  a  cost  c(h)  for  each  weak  predictor  h.  and  that  the 
SpeedBoost  approach  (Section  6.3)  we  use  a  cost-greedy  criteria  for  selecting  predictors  h: 

ht,at  =  argmax 

The  adapted  cost  model  for  the  additive  weak  predictor  (Equation  (7.2))  is  then  simply  the  sum 
of  the  cost  of  evaluating  both  the  selection  function  and  the  prediction  function, 

c(h)  =  c(hs)  +  c(hP).  (7.6) 

Algorithm  7.1  summarizes  the  StructuredSpeedBoost  algorithm  for  anytime  structured 
prediction.  It  is  based  off  of  the  structured  functional  gradient  algorithm  (Algorithm  3.1),  modified 
with  the  cost-greedy  SpeedBoost  criteria  from  Chapter  6  to  select  the  most  cost  efficient  pair  of 
selection  and  prediction  functions. 

It  enumerates  the  candidate  selection  functions,  h$,  creates  the  training  dataset  defined  by 
Equation  (7.4),  and  then  generates  a  candidate  prediction  function  hP  using  each  weak  learning 
algorithm.  For  all  the  pairs  of  candidates,  it  uses  Equation  (7.5)  for  picking  the  best  pair,  instead 
of  the  non-anytime  version,  which  simply  optimizes  the  regular  functional  gradient  criteria. 


'R'  [ft-i\  ~  'R-  [ft.-i  +  q/i] 

c(h) 


Algorithm  7.1  StructuredSpeedBoost 

Given:  objective  1Z,  set  of  selection  functions  77s,  set  of  L  learning  algorithms  {A{\f=l,  number 
of  iterations  T,  initial  function  f0. 
for  t  —  1, . . . ,  T  do 
H*  =  0 

for  hs  e  77s  do 

Create  dataset  D  =  gradient (ft-i,  hs)  using  Equation  (3.9). 
for  A  G  {.Ai, . . . ,  Al}  do 
Train  hP  —  A(D) 

Define  h  from  h$  and  hP  using  Equation  (3.6). 

u*  =  n*u{h} 

end  for 
end  for 

m  _  arp-max  H[ft-i]-H[ft-i+ah] 

ft  —  ft- i  +  atht 

end  for 
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(a)  (b) 


Figure  7.1:  Hierarchical  Inference  Machines  [Munoz  et  ah,  2010].  (a)  Input  image,  (b)  The  image  is 
segmented  multiple  times;  predictions  are  made  and  passed  between  levels.  Images  courtesy  of  the  authors’ 
ECCV  2010  presentation. 

7.3  Anytime  Scene  Understanding 

7.3.1  Background 

In  addition  to  part-of-speech  tagging  in  natural  language  processing,  scene  understanding  in  com¬ 
puter  vision  is  another  important  and  challenging  structured  prediction  problem.  The  de  facto 
approach  to  this  problem  is  with  random  field  based  models  [Kumar  and  Hebert,  2006,  Gould 
et  al.,  2008,  Ladicky  et  al.,  2010],  where  the  random  variables  in  the  graph  represent  the  object 
category  for  a  region/patch  in  the  image.  While  random  fields  provide  a  clean  interface  between 
modeling  and  inference,  recent  works  [Tu  and  Bai,  2010,  Munoz  et  al.,  2010,  Socher  et  al.,  2011, 
Farabet  et  al.,  2013]  have  demonstrated  alternative  approaches  that  achieve  equivalent  or  improved 
performances  with  the  additional  benefit  of  a  simple,  efficient,  and  modular  inference  procedure. 

Inspired  by  the  hierarchical  representation  used  in  the  state-of-the-art  scene  understanding  tech¬ 
nique  from  Munoz  et  al.  [2010],  we  apply  StructuredSpeedBoost  to  the  scene  understand¬ 
ing  problem  by  reasoning  over  differently  sized  regions  in  the  scene.  In  the  following,  we  briefly 
review  the  hierarchical  inference  machine  (HIM)  approach  from  [Munoz  et  al.,  2010]  and  then 
describe  how  we  can  perform  an  anytime  prediction  whose  structure  is  similar  in  spirit. 

7.3.2  Hierarchical  Inference  Machines 

HIM  parses  the  scene  using  a  hierarchy  of  segmentations,  as  illustrated  in  Figure  7.1.  By  in¬ 
corporating  multiple  different  segmentations,  this  representation  addresses  the  problem  of  scale 
ambiguity  in  images.  Instead  of  performing  (approximate)  inference  on  a  large  random  field  de¬ 
fined  over  the  regions,  inference  is  broken  down  into  a  sequence  of  predictions.  As  illustrated  in 
Figure  7.1,  a  predictor  /  is  associated  with  each  level  in  the  hierarchy  that  predicts  the  probability 
distribution  of  classes/objects  contained  within  each  region.  These  predictions  are  then  used  by  the 
subsequent  predictor  in  the  next  level  (in  addition  to  features  derived  from  the  image  statistics)  to 
make  refined  predictions  on  the  finer  regions;  and  the  process  iterates.  By  passing  class  distribu¬ 
tions  between  predictors,  contextual  information  is  modeled  even  though  the  segmentation  at  any 
particular  level  may  be  incorrect.  We  note  that  while  Figure  7.1  illustrates  a  top-down  sequence 
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over  the  hierarchy,  in  practice,  the  authors  iterate  up  and  down  the  hierarchy  which  we  also  do  in 
our  comparison  experiments. 


7.3.3  Speedy  Inference  Machines 

While  HIM  decomposes  the  structured  prediction  problem  into  an  efficient  sequence  of  predic¬ 
tions,  it  is  not  readily  suited  for  an  anytime  prediction.  First,  the  final  predictions  are  generated 
when  the  procedure  terminates  at  the  leaf  nodes  in  the  hierarchy.  Hence,  interrupting  the  procedure 
before  then  would  result  in  final  predictions  over  coarse  regions  that  may  severely  undersegment 
the  scene.  Second,  the  amount  of  computation  time  at  each  step  of  the  procedure  is  invariant  to  the 
current  performance.  Because  the  structure  of  the  sequence  is  predefined,  the  inference  procedure 
will  predict  multiple  times  on  a  region  as  it  traverses  over  the  hierarchy,  even  though  there  may 
be  no  room  for  improvement.  Third,  the  input  to  each  predictor  in  the  sequence  is  a  fixed  feature 
descriptor  for  the  region.  Because  these  input  descriptors  must  be  precomputed  for  all  regions  in 
the  hierarchy  before  the  inference  process  begins,  there  is  a  fixed  initial  computational  cost.  In 
the  following,  we  describe  how  StructuredSpeedBoost  addresses  these  three  problems  three 
problems  for  anytime  scene  understanding. 


Scene  Understanding  Objective 

In  order  to  address  the  first  issue,  we  learn  an  additive  predictor  /  which  predicts  a  per-pixel 
classification  for  the  entire  image  at  once.  In  contrast  to  HIM  whose  multiple  predictors’  losses  are 
measured  over  regions,  we  train  a  single  predictor  whose  loss  is  measured  over  pixels.  Concretely, 
given  per-pixel  ground  truth  distributions  pj  e  MA ,  we  wish  to  optimize  per-pixel,  cross-entropy 
risk  for  all  pixels  in  the  image 


n[f\  =  ea. 


EE  Pjklogq(f(x))jk  \  , 

3  k 


(7.7) 


where 

eMVjk)  78, 

E^xpfe)’  ■  '  ; 

i.e.,  the  probability  of  the  fc’th  class  for  the  j ’ th  pixel.  Using  Equation  (7.2),  the  probability  dis¬ 
tribution  associated  with  each  pixel  is  then  dependent  on  1)  the  pixels  to  update,  selected  by  hs, 
and  2)  the  value  of  the  predictor  hp  evaluated  on  those  respective  pixels.  The  definition  of  these 
functions  are  defined  in  the  following  subsections. 

Structure  Selection  and  Prediction 

In  order  to  account  for  scale  ambiguity  and  structure  in  the  scene,  we  can  similarly  integrate  mul¬ 
tiple  regions  into  our  predictor.  By  using  a  hierarchical  segmentation  of  the  scene  that  produces 
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many  segments/regions,  we  can  consider  each  resulting  region  or  segment  of  pixels  S  in  the  hier¬ 
archy  as  one  possible  set  of  outputs  to  update.  Intuitively,  there  is  no  need  to  update  regions  of  the 
image  where  the  predictions  are  correct  at  the  current  inference  step.  Hence,  we  want  to  update 
the  portion  of  the  scene  where  the  predictions  are  uncertain,  i.e.,  have  high  entropy  H.  To  achieve 
this,  we  use  a  selector  function  that  selects  regions  that  have  high  average  per-pixel  entropies  in 
the  current  predictions, 


H(q{y)j)>ey  (7.9) 

for  some  fixed  threshold  6.  In  practice,  the  set  of  predictors  TLs  used  at  training  time  is  created 
from  a  diverse  set  of  thresholds  6. 

Additionally,  we  assume  that  the  features  ^  used  for  each  pixel  in  a  given  selected  region  are 
drawn  from  the  entire  region,  so  that  if  a  given  scale  is  selected  features  corresponding  to  that  scale 
are  used  to  update  the  selected  pixels.  For  a  given  segment  S,  call  this  feature  vector  Us- 

Given  the  above  selector  function,  we  use  Equation  (7.3)  to  find  the  next  best  predictor  function, 
as  in  Algorithm  7.1,  optimizing 

hp  —  arg min  ^  II v(x)j  -  M^s)l|2-  (7-10) 

hp  S£hs(x,y )  jeS 

Because  all  pixels  in  a  given  region  use  the  same  feature  vector,  this  reduces  to  the  weighted 
least  squares  problem: 


hp  —  arg  min  ^  l‘S1ll  Vs  -  hP{^s)\\2.  (7.11) 

hp  S£hs(x,y) 

where  Vs  =  Ejes[V (x)j]  =  Ey Gs [p:j  —  q(y)j\.  In  words,  we  find  a  vector-valued  regressor  hp 
with  minimal  weighted  least  squares  error  between  the  difference  in  ground  truth  and  predicted 
per-pixel  distributions,  averaged  over  each  selected  region/segment,  and  weighted  by  the  size  of 
the  selected  region.  This  is  an  intuitive  update  that  places  large  weight  to  updating  large  regions. 

Dynamic  Feature  Computation 

In  the  scene  understanding  problem,  a  significant  computational  cost  during  inference  is  often 
feature  descriptor  computation.  To  this  end,  we  utilize  the  SpeedBoost  cost  model  Equation  (7.6) 
to  automatically  select  the  most  computationally  efficient  features. 

The  features  used  in  this  application,  drawn  from  previous  work  [Gould  et  al.,  2008,  Ladicky, 
2011]  and  detailed  in  the  following  section,  are  computed  as  follows.  First,  a  set  of  base  feature 
descriptors  are  computed  from  the  input  image  data.  In  many  applications  it  is  useful  to  quantize 
these  base  feature  descriptors  and  pool  them  together  to  form  a  set  of  derived  features  [Coates 
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FH 

SHAPE 

TXT  (B) 

TXT  (D) 

LBP (B) 

LBP  (D) 

I-SIFT (B) 

I-SIFT (D) 

C-SIFT  (B) 

C-SIFT  (D) 

167 

2 

29 

66 

64 

265 

33 

165 

93 

443 

Table  7.1:  Average  timings  (ms)  for  computing  all  features  for  an  image  in  the  SBD.  (B)  is  the  time  to 
compute  the  base  per-pixel  feature  responses,  and  (D)  is  the  time  to  compute  the  derived  averaged-pooled 
region  codes  to  all  cluster  centers. 


et  al.,  2011].  We  follow  the  soft  vector  quantization  approach  in  [Coates  et  al.,  2011]  to  form  a 
quantized  code  vector  by  computing  distances  to  multiple  cluster  centers  in  a  dictionary. 

This  computation  incurs  a  fixed  cost  for  1)  each  group  of  features  with  a  common  base  feature, 
and  2)  an  additional,  smaller  fixed  cost  for  each  actual  feature  used.  In  order  to  account  for  these 
costs,  we  use  an  additive  model  similar  to  Xu  et  al.  [2012]  and  the  budgeted  feature  selection 
application  examine  in  the  experimental  analysis  of  the  SpeedBoost  algorithm,  in  Section  6.5.3. 

Formally,  let  0  e  <f>  be  the  set  of  features  and  7  e  T  be  the  set  of  feature  groups,  and  c,P  and 
c7  be  the  cost  for  computing  derived  feature  0  and  the  base  feature  for  group  7,  respectively.  Let 
$(/)  be  the  set  of  features  used  by  predictor  /  and  Y(f)  the  set  of  its  used  groups.  Given  a  current 
predictor  ft_i,  its  group  and  derived  feature  costs  are  then  just  the  costs  of  any  new  group  and 
derived  features  and  have  not  previously  been  computed: 

cr  (hP)  =  Y  c7> 

7er(hp)\r(/t_i) 

c<i>{hp)  =  Yj  cA 

0e$(fcP)\$(/t-i) 


The  total  cost  model  in  Equation  (7.6)  can  then  be  derived  using  the  sum  of  the  feature  costs  and 
group  costs  as 


c(h)  =  c(hs)  +  c(hP) 

=  es  +  ep  +  cr(fi-p)  +  c*(/iP), 


(7.12) 


where  es  and  eP  are  small  fixed  costs  for  evaluating  a  selection  and  prediction  function,  respectively. 

In  order  to  generate  hP  with  a  variety  of  costs,  we  use  a  modified  regression  tree  that  penalizes 
each  split  based  on  its  potential  cost,  as  in  [Xu  et  al.,  2012].  This  approach  augments  the  least- 
squares  regression  tree  impurity  function  with  a  cost  regularizer: 


Ed  wd \\Ud  —  ^p(^d)||2  +  A  (cr(/rp)  +  c$(fiP)) ,  (7.13) 


where  A  regularizes  the  cost.  In  addition  to  Equation  (7.12),  training  regression  trees  with  different 
values  of  A,  enables  StructuredSpeedBoost  to  automatically  select  the  most  cost-efficient 
predictor. 
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SIM 
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62.5 

76.9 

10.5 

63.8 

69.1 

78.8 

HIM  [Munoz  et  al.,  2010] 
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77.5 

89.9 

83.6 

70.9 

83.2 

17.2 

69.3 

73.1 

82.1 

[Farabet  et  al.,  2013] 

95.7 

78.7 

88.1 

89.7 

68.7 

79.9 

44.6 

62.3 

76.0 

81.4 

[Socher  et  al.,  2011] 

- 

- 

- 

- 

- 

- 

- 

- 

- 

78.1 

bh 

2 

<D 

D 

S-H 

G 

o 

CD 

<D 

O 

G 

<D 

o 

£ 

<D 

M 

>1 

s 

H 

in 

U 

in 

Ph 

P-H 

CU 

in 

s 

G 

£ 

SIM 

76.8 

79.2 

94.1 

71.8 

30.2 

95.2 

34.6 

25.7 

14.0 

66.3 

15.0 

54.8 

81.5 

HIM  [Munoz  et  al.,  2010] 

83.3 

82.2 

95.9 

75.2 

42.2 

96.0 

38.6 

21.5 

13.6 

72.1 

33.3 

59.4 

84.9 

[de  Nijs  et  al.,  2012] 

59 

75 

93 

84 

45 

90 

53 

27 

0 

55 

21 

54.7 

75.0 

[Ladicky  et  al.,  2010]’1' 

81.5 

76.6 

96.2 

78.7 

40.2 

93.9 

43.0 

47.6 

14.3 

81.5 

33.9 

62.5 

83.8 

Table  7.2:  Recalls  on  the  Stanford  Background  Dataset  (top)  and  CamVid  (bottom)  where  Class  is  the 
average  per-class  recall  and  Pixel  is  the  per-pixel  accuracy.  ifJses  additional  training  data  not  leveraged  by 
other  techniques. 


7.4  Experimental  Analysis 

7.4.1  Setup 

We  evaluate  performance  metrics  between  SIM  and  HIM  on  the  1)  Stanford  Background  Dataset 
(SBD)  [Gould  et  al.,  2009],  which  contains  8  classes,  and  2)  Cambridge  Video  Dataset  (CamVid) 
[Brostow  et  al.,  2008],  which  contains  11  classes;  we  follow  the  same  training/testing  evalua¬ 
tion  procedures  as  originally  described  in  the  respective  papers.  As  shown  in  Table  7.2,  we  note 
that  HIM  achieves  state-of-the-art  performance  and  these  datasets  and  analyze  the  computational 
tradeoffs  when  compared  with  SIM.  Since  both  methods  operate  over  a  region  hierarchy  of  the 
scene,  we  use  the  same  segmentations,  features,  and  regression  trees  (weak  predictors)  for  a  fair 
comparison. 


Segmentations 

We  construct  a  7-level  segmentation  hierarchy  by  recursively  executing  the  graph-based  segmen¬ 
tation  algorithm  (FH)  [Felzenszwalb  and  Huttenlocher,  2004]  with  parameters 


a  =  0.25,  c  =  102  x  [1,  2,  5, 10,  50, 200, 500], 
k  =  [30,  50,  50, 100, 100,  200,  300], 


These  values  were  qualitatively  chosen  to  generate  regions  at  different  resolutions. 
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Features 

A  region’s  feature  descriptor  is  composed  of  5  feature  groups  (r):  1)  region  boundary  shape/geometry /location 
(SHAPE)  [Gould  et  al.,  2008],  2)  texture  (TXT),  3)  local  binary  patterns  (LBP),  4)  SIFT  over  in¬ 
tensity  (I— SIFT),  5)  SIFT  separately  over  colors  R,  G,  and  B  (C-SIFT).  The  last  4  are  derived 
from  per-pixel  descriptors  for  which  we  use  the  publicly  available  implementation  from  [Ladicky, 

2011], 

Computations  for  segmentation  and  features  are  shown  in  Table  7.1;  all  times  were  computed 
on  an  Intel  i7-2960XM  processor.  The  SHAPE  descriptor  is  computed  solely  from  the  segmentation 
boundaries  and  is  efficient  to  compute.  The  remaining  4  feature  group  computations  are  broken 
down  into  the  per-pixel  descriptor  (base)  and  the  average-pooled  vector  quantized  codes  (derived), 
where  each  of  the  4  groups  are  quantized  separately  with  a  dictionary  size  of  150  elements/centers 
using  k- means.  For  a  given  pixel  descriptor,  v,  its  code  assignment  to  cluster  center,  //,,  is  derived 
from  its  squared  L2  distance  di(v)  =  ||i;  —  Hi\\\-  Using  the  soft  code  assignment  from  [Coates 
et  al.,  2011],  the  code  is  defined  as  max(0,  Zi(v)),  where 


Zi(v)  =  E j[dj(v)]  -  diiv)  (7.14) 

=  Ej[||/Ul|2]  -2(Ej[/uW  -  (IIaoII2  —  2(/A, v)).  (7.15) 

Note  that  the  expectations  are  indepndent  from  the  query  descriptor  v,  hence  the  i’th  code  can  be 
computed  independently  and  enables  selective  computation  for  the  region.  The  resulting  quantized 
pixel  codes  are  then  averaged  within  each  region.  Thus,  the  costs  to  use  these  derived  features 
are  dependent  if  the  pixel  descriptor  has  already  been  computed  or  not.  For  example,  when  the 
weak  learner  first  uses  codes  from  the  I  -S I  FT  group,  the  cost  incurred  is  the  time  to  compute  the 
I -SI  FT  pixel  descriptor  plus  the  time  to  compute  distances  to  each  specified  center. 


7.4.2  Analysis 

In  Figure  7.4  we  show  which  cluster  centers,  from  each  of  the  four  groups,  are  being  selected  by 
SIM  as  the  inference  time  increases.  We  note  that  efficient  SHAPE  descriptor  is  chosen  on  the  first 
iteration,  followed  by  the  next  cheapest  descriptors  TXT  and  I -SI  FT.  Although  LBP  is  cheaper 
than  C-SIFT,  the  algorithm  ignored  LBP  because  it  did  not  improve  prediction  wrt  cost. 

In  Figure  7.2,  we  compare  the  classification  performance  of  SIM  and  several  other  algorithms 
with  respect  to  inference  time.  We  consider  HIM  as  well  as  two  variants  which  use  a  limited  set  of 
the  4  feature  groups  (only  TXT  and  TXT  &  I-SIFT);  these  SIM  and  HIM  models  were  executed 
on  the  same  computer.  We  also  compare  to  the  reported  performances  of  other  techniques  and 
stress  that  these  timings  are  reported  from  different  computing  configurations.  The  single  anytime 
predictor  generated  by  our  anytime  structured  prediction  approach  is  competitive  with  all  of  the 
specially  trained,  standalone  models  without  requiring  any  of  the  manual  analysis  necessary  to 
create  the  different  fixed  models. 
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In  Figure  7.3,  we  show  the  progress  of  the  SIM  algorithm  as  it  processes  a  scene  from  each  of 
the  datasets.  Over  time,  we  see  the  different  structural  nodes  (regions)  selected  by  the  algorithm  as 
well  as  improving  classification. 


Pixel  Accuracy  Pixel  Accuracy 
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Figure  7.2:  Average  pixel  classification  accuracy  for  SBD  (top)  and  CamVid  (bottom)  datasets  as  a  function 
of  inference  time. 
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Figure  7.3:  Sequence  of  images  displaying  the  inferred  labels  and  selected  regions  at  iterations 
t  =  {1,5,15,50,100,225}  of  the  SIM  algorithm  for  a  sample  image  from  the  Stanford  Back¬ 
ground  (top)  and  CamVid  (bottom)  datasets.  The  corresponding  inference  times  for  these  iterations  are 
{0.42s,  0.44s,  0.47s,  0.79s,  1.07s,  1.63s}  (top)  and  {0.41s,  0.42s,  0.44s,  0.52s,  0.85s,  1.42s}  (bottom). 
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Figure  7.4:  The  number  of  cluster  centers  selected  within  each  feature  group  by  SIM  as  a  function  of 
inference  time. 
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Chapter  8 
Conclusion 


8.1  Future  Directions 

In  this  thesis  we  propose  a  framework  for  anytime  prediction  and  give  algorithms  for  learning  any¬ 
time  predictors  based  on  functional  gradient  methods  and  greedy  optimization.  Using  these  two 
areas,  we  give  analysis  and  theoretical  guarantees  that  show  that  our  anytime  prediction  algorithms 
make  near-optimal  trade-offs  between  cost  and  accuracy  without  knowing  the  prediction  time  con¬ 
straints  apriori.  There  are  a  number  of  areas  where  we  believe  this  sequential  anytime  prediction 
approach  and  accompanying  analysis  can  be  extended. 

8.1.1  Anytime  Representation  Learning 

Xu  et  al.  [2013a]  have  proposed  an  anytime  prediction  approach  which  learns  feature  represen¬ 
tations  which  change  over  time,  and  then  computes  predictions  using  these  representations.  In  a 
similar  fashion  to  deep  network  approaches,  they  leam  a  set  of  predictors  {fd}d=i  which  output  a 
D  dimensional  feature  representation  and  then  combine  this  representation  with  a  top  layer  using 
some  simple  linear  function  of  the  learned  features,  such  as  a  Support  Vector  Machine. 

We  imagine  that  a  similar  approach  to  anytime  representation  learning  could  be  derived  using 
SpeedBoost  as  a  base  along  with  our  prior  work  in  backpropagated  functional  gradient  tech¬ 
niques  [Grubb  and  Bagnell,  2010]  to  build  a  similar  network.  By  learning  the  layer  of  repre¬ 
sentation  predictors  fd  using  a  cost-greedy  SpeedBoost  approach  and  using  functional  gradient 
backpropagation  to  optimize  the  complete  network,  we  may  be  able  to  improve  the  cost,  accuracy 
trade-off  behavior  of  their  previous  anytime  representation  learning  approach,  or  even  present  a 
simpler  algorithm  for  learning  anytime  representation  learners. 

It  would  be  interesting  to  see  how  this  approach  compares  to  the  previous  work  in  anytime 
representation  learning,  and  if  this  anytime  representation  approach  offers  any  advantages  over  the 
direct  anytime  prediction  approach  we’ve  learned  here.  Perhaps  the  representation  learned  could 
be  used  to  enable  anytime  behavior  in  some  other  setting  by  using  the  anytime  representation  as 
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input,  such  as  anytime  clustering  or  anytime  dimensionality  reduction. 

8.1.2  Parallel  Weak  Predictors 

In  this  document,  we  have  considered  only  the  computational  model  where  weak  predictors  are 
sequentially  applied  to  a  given  input  until  interrupted.  One  alternative  model  is  to  consider  the 
setting  where  weak  predictors  can  freely  be  run  in  parallel.  Consider  the  example  of  a  large-scale 
web  service  which  must  compute  predictions  on  a  large  number  of  inputs  simultaneously.  Typically 
in  such  a  setting  there  are  a  number  of  other  prediction  algorithms  running  remotely  that  provide 
weak  predictions  to  the  overall  web-service.  In  this  case  selecting  a  single  weak  predictor  to  run  is 
counterproductive,  as  it  is  much  more  efficient  to  make  requests  to  the  other  services  in  parallel. 

In  the  parallel  setting,  we  imagine  that  there  are  actually  two  computational  constraints  at  play: 
the  amount  of  parallelism  available  and  the  amount  of  time  or  total  computation  available  to  the 
system.  One  approach  would  be  to  only  consider  weak  predictors  that  are  parallelized  to  take 
advantage  of  any  parallel  resources,  and  revert  to  the  sequential  model  we’ve  outlined  here. 

An  approach  that  would  perhaps  be  more  widely  applicable  is  to  consider  weak  predictors  that 
are  unparallelized,  i.e.  single  threaded,  and  derive  algorithms  which  select  weak  predictors  to  run  in 
parallel.  The  anytime  algorithms  we  presented  here  can  naively  be  parallelized  by  simply  schedul¬ 
ing  the  sequences  of  weak  predictors  selected  using  some  scheduling  policy,  but  it  is  unclear  what 
kind  of  guarantees  can  be  derived  here,  and  if  some  joint  learning  of  scheduling  policy  and  weak 
predictors  to  use  in  the  ensemble  could  perform  better. 

One  other  setting  to  consider  would  be  one  more  similar  to  the  web  service  setting  used  as  an 
illustrative  example.  In  this  setting  all  weak  predictors  can  potentially  be  run  in  parallel,  but  the 
cost  of  evaluating  a  weak  predictor  may  be  dependent  on  what  portion  of  the  data  is  sent  to  it.  In 
this  setting  the  learning  problem  would  be  to  find  a  policy  for  evaluating  different  weak  predictors 
on  a  given  example,  where  weak  predictors  can  also  be  evaluated  in  parallel.  For  example,  it  may 
best  to  evaluate  a  single  simple  weak  predictor  on  all  examples,  then,  depending  on  the  outcome, 
evaluate  many  expensive  weak  predictors  in  parallel.  It  may  be  that  the  same  algorithms  can  be 
used  for  both  of  these  parallel  settings.  Investigating  anytime  prediction  algorithms  for  this  setting 
would  be  interesting  as  well. 

8.1.3  Branching  Predictors 

Many  recent  approaches  to  the  budgeted  prediction  problem  utilize  a  branching  approach  which 
can  compute  different  weak  predictors,  or  use  different  actions,  on  each  example  based  on  previous 
predictions.  For  example,  policy  based  approaches  [Busa-Fekete  et  al.,  2012,  Karayev  et  al.,  2012, 
He  et  al.,  2013]  learn  a  policy  which  uses  previous  predictions  to  select  which  actions  to  take 
next,  and  hence  can  select  to  compute  different  predictors  based  on  the  outcome  of  early  predictive 
actions.  Similarly,  Gao  and  Koller  [2011]  use  an  approach  which  conditions  actions  on  previous 
predictions,  and  Xu  et  al.  [2013b]  give  an  extension  of  the  Greedy  Miser  approach  for  learning  a 
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tree  of  weak  predictors  instead  of  a  sequence. 

Many  of  these  approaches  observe  a  significant  decrease  in  cost  for  achieving  the  same  accu¬ 
racy,  because  individual  weak  predictors  can  be  targeted  to  different  subsets  of  the  input  space.  To 
compare  with  these  approaches  and  explore  possible  gains  from  branching,  it  would  be  interesting 
to  consider  a  branching  version  of  our  anytime  approach,  which  leam  a  tree  of  weak  predictors 
in  a  manner  similar  to  Xu  et  al.  [2013b].  To  leam  their  tree  based  structure  they  utilize  a  global 
optimization  which  adjusts  all  nodes  in  the  tree  simultaneously  and  optimize  final  performance.  It 
would  be  interesting  to  see  if  the  greedy  approach  here  could  be  adapted  to  select  a  tree  of  predic¬ 
tors,  or  if  some  kind  of  global  optimization  of  decisions  must  be  done  to  obtain  efficient  tree-based 
performance. 

8.1.4  Understanding  Generalization  Properties 

Throughout  this  document  we  have  analyzed  the  predictive  performance  of  our  learned  anytime 
predictors  using  the  training  performance  as  a  metric,  and  the  near-optimality  guarantees  given 
are  statements  about  the  near-optimality  of  cost-greedy  algorithms  with  respect  to  the  optimal 
training  performance.  However,  in  practice,  we  often  see  that  these  cost-greedy  approaches  cause 
an  increase  in  overfitting,  particularly  in  the  domains  where  computation  time  is  dominated  by 
feature  computation  costs,  by  emphasizing  the  maximal  re-use  of  already  computed  features. 

A  useful  line  of  work  would  be  to  analyze  the  generalization  properties  of  our  anytime  ap¬ 
proach,  and  possibly  improve  the  robustness  to  overfitting  by  modifying  our  algorithms.  We  have 
developed  some  methods  for  doing  this  in  our  practical  applications,  such  as  evaluating  the  cost- 
greedy  metric  on  held-out  validation  data,  but  there  is  still  much  work  to  be  done  in  understanding 
how  to  handle  overfitting  in  general. 
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